--- title: "Structure in talker variability: How much is there and how much can it help?" shorttitle: "Structure in talker variability" author: - Dave F. Kleinschmidt bibliography: talker-variability.bib abstract: > One of the persistent puzzles in understanding human speech perception is how listeners cope with talker variability. One thing that might help listeners is structure in talker variability: rather than varying randomly, talkers of the same gender, dialect, age, etc. tend to produce language in similar ways. Listeners are sensitive to this covariation between linguistic variation and socio-indexical variables. In this paper I present new techniques based on ideal observer models to quantify 1) the amount and type of structure in talker variation (_informativity_ of a grouping variable), and 2) how useful such structure can be for robust speech recognition in the face of talker variability (the _utility_ of a grouping variable). I demonstrate these techniques in two phonetic domains---word-initial stop voicing and vowel identity---and show that these domains have different amounts and types of talker variability, consistent with previous, impressionistic findings. An `R` package ([`phondisttools`](https://github.com/kleinschmidt/phondisttools)) accompanies this paper, and the source and data are available from [osf.io/zv6e3](https://osf.io/zv6e3/). authornote: > I gratefully acknowledge Cynthia Clopper, Shannon Heald, Andy Wedel, and Noah Nelson for sharing their measurements of speech production data with us. Without their generosity this work would not have been possible. The techniques proposed here were originally developed jointly with Kodi Weatherholtz. I thank Florian Jaeger for feedback on earlier versions of this work, as well as Rory Turnbull and two anonymous reviewers. I also thank the developers of the R language [@R2017] as well as the following packages: `tidyverse` [@tidyverse], `rmarkdown` [@rmarkdown], `knitr` [@knitr], `cowplot` [@cowplot], `mvtnorm` [@mvtnorm], and `ggbeeswarm` [@ggbeeswarm]. This work was partially funded by NIH NICHD R01 HD075797 and NIH NICHD F31 HD082893. The views expressed here are those of the author and not necessarily those of the funding agencies. Address correspondence about this article to Dave F. Kleinschmidt, Princeton Neuroscience Institute, Princeton University, Washington Road, Princeton NJ 08544, email output: html_document: code_folding: hide dev: png keep_md: true md_extensions: +implicit_figures+pipe_tables+table_captions pandoc_args: - --filter - pandoc-fignos - --filter - pandoc-tablenos - --filter - pandoc-eqnos - --csl - apa.csl pdf_document: md_extensions: +implicit_figures+tex_math_single_backslash+pipe_tables+table_captions keep_tex: true latex_engine: xelatex template: apa6.template.tex citation_package: biblatex dev: cairo_pdf pandoc_args: - --filter - pandoc-fignos - --filter - pandoc-tablenos - --filter - pandoc-eqnos mainfont: "CMU Serif" --- ```{r preamble, message=FALSE, warning=FALSE, error=FALSE, echo=FALSE, results='hide'} library(knitr) opts_chunk$set(message=FALSE, warning=FALSE, error=FALSE, echo=opts_knit$get("rmarkdown.pandoc.to") != 'latex', cache=TRUE, results="hide") ## Produce markdown-formatted figures so that pandoc knows what to do with ## the captions. requires pandoc-fignos to parse the IDs. refer to figures ## in text with {@fig:label} or just @fig:label ## ## (see https://github.com/tomduck/pandoc-fignos) knit_hooks$set(plot = function(x, options) { paste0('![', options$fig.cap, ']', '(', opts_knit$get('base.url'), paste(x, collapse='.'), ')', '{#fig:', options$label, '}') }) ## Produce markdown-formatted table captions with anchors for cross-refs. ## Requires pandoc-tablenos to parse the IDs. Refer to tables ## in text with {@tbl:label} or @tbl:label. ## Based partly on http://stackoverflow.com/a/18672268 ## ## (see https://github.com/tomduck/pandoc-tablenos) knit_hooks$set(tbl.cap = function(before, options, envir) { if(!before){ paste0('\n\nTable: ', options$tbl.cap, ' {#tbl:', options$label, '}', sep = '') } }) library(tidyverse) library(magrittr) library(rlang) library(stringr) library(forcats) library(purrrlyr) library(assertthat) library(mvtnorm) library(multidplyr) library(future) plan(multicore) library(ggbeeswarm) library(svglite) library(ggplot2) library(cowplot) library(ggrepel) ## devtools::install_github('kleinschmidt/daver') library(daver) ## devtools::install_github("kleinschmidt/phondisttools") library(phondisttools) ## devtools::install_github('kleinschmidt/nspvowels') library(nspvowels) ## devtools::install_github("kleinschmidt/healdvowels") library(healdvowels) ## devtools::install_github('kleinschmidt/votcorpora') library(votcorpora) ## cowplot theme + y axis gridlines theme_set(theme_cowplot() %+replace% theme(panel.grid.major = element_line(colour='gray90', size=0.2), panel.grid.minor = element_line(colour='gray98', size=0.5), panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())) rotate_x_axis_labs <- function(by=45) theme(axis.text.x = element_text(angle=by, hjust=1)) assert_has_names <- function(x, names) { walk(names, ~assert_that(has_name(x, .))) } apply_groupings <- function(d, groupings) { assert_that(has_name(d, "data"), msg="d is not nested (no data column)") walk(d$data, ~assert_has_names(., groupings)) groupings %>% map(~ d %>% mutate(data = map(data, group_by_, .x), grouping = .x)) %>% reduce(bind_rows) } #' Train models for each combination of group and phonetic category #' #' For a variety of grouping levels. #' #' @param data_grouped a tbl with the grouped data. columns `data` (a list #' column with tbls suitable to pass to `phondisttools::train_models`, which are #' grouped according to the corresponding value of `grouping` and also have #' the columns named in category_col, and cue_cols) and `grouping` which names #' the grouping variable that defines the groups of `data`. #' @param category_col the name of the column in the elements of #' `data_grouped$data` that have the phonetic category #' @param cue_cols the name of the column(s) in the elements of #' `data_grouped$data` that have the cue values to train models on. #' #' @return a tbl with columns `grouping` and `models` (a list column of tbls, #' which hold trained models for each combination of `category_cols` and #' `group`), plus any additional columns in `data_grouped`. #' train_models_grouped <- function(data_grouped, category_col, cue_cols, ...) { ## check input format assert_that(has_name(data_grouped, 'data'), has_name(data_grouped, 'grouping')) train <- partial(phondisttools::train_models, grouping=category_col, cues=cue_cols, ...) data_grouped %>% mutate(models = map2(data, grouping, ~ train(.x) %>% rename_(group=.y)), data = NULL) } ``` ```{r grouping-and-colors, cache=TRUE} grouping_levels <- c('Marginal', 'Age', 'Gender', 'Dialect', 'Dialect+Gender', 'Talker') gender <- function(x) str_replace(x, "Sex", "Gender") prettier_grouping <- function(d, col=grouping) { col <- enquo(col) d %>% mutate(!!quo_name(col) := map_chr(!!col, str_replace, "_", "+") %>% map_chr(gender) %>% factor(levels=grouping_levels)) } scale_color_grouping <- function(groupings=grouping_levels, ...) { n <- length(groupings) colors <- c("#808080", "#FF8500", "#00D2F7", "#00C928", "#FF20AF", "#7A78FF") ggplot2:::manual_scale("colour", values = set_names(colors, groupings), ...) } pretty_contrast <- function(d) { mutate(d, contrast = factor(contrast, levels=c("Vowels (HN15)", "Vowels (NSP)", "Stop voicing"))) } ``` ```{r labeler} ## a small function that creates \label{}s for latex output if (opts_knit$get("rmarkdown.pandoc.to") == "latex") { label <- function(lab) paste0("\\label{", lab, "}") } else { label <- function(lab) "" } ``` # Introduction The apparent ease and robustness of spoken language understanding belie the considerable computational challenges involved in mapping speech input to linguistic categories. One of the biggest computational challenges stems from the fact that talkers differ from each other in how they pronounce the same phonetic contrast. One talker’s realization of /s/ (as in “seat”), for example, might sound like another talker’s realization of /ʃ/ (as in “sheet”) [@Newman2001]. During speech perception, such inter-talker variability contributes to the *lack of invariance* problem, creating uncertainty about the mapping between acoustic cues and linguistic categories [@Liberman1967]. A number of proposals for how listeners overcome this problem have been offered. A common theme that has emerged is that listeners seem to take advantage of statistical contingencies in the speech signal [for a recent review, see @WeatherholtzInPress]. These contingencies result in part from the fact that inter-talker variability is not random. Rather, inter-talker differences in the cue-to-category mapping are systematically conditioned by a range of factors. This includes both talker-specific anatomy of the vocal tract [@Fitch1999; @Johnson1993] and factors pertaining to a talker's social-indexical group memberships, such as age [@Lee1999], gender [@Perry2001; @Peterson1952], and dialect [@Labov2006]. Listeners seem to draw on these statistical contingencies between linguistic variability on the one hand and talker- and group-specific factors on the other. Upon encountering an unfamiliar talker, for example, the speech perception system seems to adjust the mapping of acoustic cues to linguistic categories to reflect that talker's specific distributional statistics [@Bejjanki2011; @Clayards2008; @Idemaru2011; @Kraljic2007; @McMurray2011a]. Listeners also seem to learn and draw on expectations about cue-category mappings based on a talker’s socio-indexical group memberships. For example, listeners have been found to adjust their speech recognition based on a talker's inferred regional origin [@Hay2010; @Niedzielski1999], gender [@Strand1999; @Johnson1999], age [@Walker2011], and individual identity [@Mitchel2016; @Nygaard1994]. Such talker- and group-specific knowledge is now broadly believed to be critical to speech perception [for reviews, see @Foulkes2015a; @WeatherholtzInPress]. An important question that has largely remained unaddressed, however, is how listeners determine _which_ socio-indexical (and other) talker properties should be used for speech perception. In other words, why do listeners group talkers by, for example, age and gender, rather than the color of their shirt? _A priori_, there is an essentially infinite number of ways for a listener to group the speech they have experienced in different situations. Intuitively, we might expect listeners to be sensitive to socio-indexical properties that are _relevant_ to speech perception. Some of the possible socio- indexical groupings will be highly informative about the future cue-category mappings that a listener can expect, while others will be uninformative, or even misleading. This paper seeks to formalize this intuition, in order to derive principled quantifiable predictions for future work, drawing on a recently proposed computational framework, the ideal adapter [@Kleinschmidt2015]. `r label("r2-ideal-adapter-intro-ps")` The ideal adapter is a computational-level theory of human speech perception [in the sense of @Marr1982]. It seeks to explain aspects of speech perception by formalizing the _goals_ of speech perception and the _information_ available from the world. Like many computational-level models, it treats speech perception as a problem of _inference under uncertainty_, whereby listeners combine what they know about how speech is generated in order to recover (or infer) the most likely explanation for the speech sounds they hear. In this view, talker variability is a primary challenge for speech perception because the most likely explanation for a particular acoustic cue depends on the probabilistic distributions of cues for each possible explanation, and these distributions differ from talker to talker [e.g., @Allen2003; @Newman2001; @Hillenbrand1995]. The central insight of the ideal adapter is twofold. First, when talker variability is not completely random there is a great deal of information available from _previous experience_ with other talkers about the probabilistic distribution of acoustic cues that correspond to each possible linguistic unit. Second, in order to benefit from this information listeners must actively _learn_ the underlying structure of the talker variability that they have previously experienced, and this learning can be modeled as statistical inference itself. In other words, according to the ideal adapter, robust speech perception depends on _inferring_ how talkers should be grouped together. Thus far this is just a re-statement of the original question---which groups of talkers are worth tracking together?---but the ideal adapter also provides the theoretical framework for _answering_ it as well. According to the ideal adapter this inference depends on two related but distinct factors. The first factor is whether there is any statistically reliable grouping to be learned in the first place, or whether a hypothetical grouping leads to better predictions about acoustic-phonetic cues. The second is whether grouping talkers in a particular way leads to better speech recognition. That is, given a particular hypothesis about how previously encountered talkers might be grouped, an ideal adapter must ask themselves two questions: is this way of grouping talkers _informative_ about the acoustic-phonetic cue distributions that I have heard, and would grouping those distributions in this way be _useful_ for recognizing a future talker's linguistic intentions (e.g., phonetic categories)?[^socio-perc] [^socio-perc]: There are other, potentially important uses for tracking group-specific distributions, even when they don't aid speech perception per se. For instance, listeners could use group-specific phonetic cue distributions to infer the age, gender, regional origin, etc. of an unfamiliar talker [@KleinschmidtInPress2017], and such inferences may play an important role in coordinating group behavior [e.g., @Cohen2012] The answers to these questions can vary depending on the particular language, hypothetical grouping of talkers, and phonetic category, as well as each listener's idiosyncratic experience of talker variability. The goal of this paper is thus to not only show how these questions are formalized by the ideal adapter, but also to quantify the _amount_ and _structure_ of talker variability across two different phonetic domains (vowel identity and stop voicing). `r label("sense-of-structure")` Note that there are a number of different senses in which talker variability might be structured. Here, I focus on the extent to which variability in the acoustic realization of phonetic categories _across talkers_ is predictable and hence support generalization based on previous experience, based on socio-indexical or other grouping variables. This is different from structure _across categories_, as in the covariation in talker-specific mean VOTs for /b/, /p/, /d/, etc. [e.g., @Chodroff2017], as well as structure _across cues_, within a single category [e.g. VOT and f0 for stop voicing, @Clayards2018; @Kirby2015; etc.]. All of these sorts of structure are complimentary because they mean that observations from one talker/category/cue dimension are informative about others, and I will return to the connection in the general discussion. There are two main motivations for developing and testing these techniques. First, such quantitative assessments of the degree and structure of talker variability are a critical missing link in the research program set out by the ideal adapter. The ideal adapter makes predictions about when listeners should employ different strategies for coping with talker variation---when they should rapidly adapt, or maintain stable, long-term representations of particular talkers, or generalize from experience with one or a group of different talkers. These predictions depend in large part on how much and what kind of structure there is in talker variability. The techniques I propose here provide the necessary grounding to turn the qualitative predictions of @Kleinschmidt2015 into testable, quantitative predictions. Second, these techniques offer a general technique for quantitatively assessing the structure of talker variability from speech production data in a variety of contexts, across phonetic systems, languages, and even levels of linguistic representation. A further advantage of the techniques proposed here is that they are directly, quantitatively comparable across different phonetic categories and sets of cues. As such they are, I hope, generally useful to speech scientists and sociolinguists in a variety of theoretical frameworks, including exemplar/episodic accounts [@Johnson1997; @Goldinger1998; @Pierrehumbert2006] and normalization/cue-compensation accounts [e.g., @Cole2010; @Holt2005; @McMurray2011a]. For example, in exemplar/episodic accounts, it is sometimes assumed that speech inputs are stored along with "salient" social context [e.g., @Sumner2014]. What determines the salience of contexts is, however, left unspecified [for related discussion, see @Jaeger2016]. The informativity and utility measures explored here might serve to define and quantify salience. Additionally, the specific predictions I derive below pertain to native listeners' perception of native American English. However, this approach is more general, extending, for example, to non-native perception, and native perception of foreign-accented speech. In service of this goal, I have developed an R [@R2017] package, [`phondisttools`](https://github.com/kleinschmidt/phondisttools). The code that generated this paper is available from [osf.io/zv6e3](https://osf.io/zv6e3/), in the form of an RMarkdown document, along with the datasets. ## Outline and preview of results `r label("r2-intro-preview")` The rest of this paper is structured as follows. The following section presents the basic logic of the ideal adapter, which motivates the measures of informativity and utility. The section after that describes the general methods used to estimate phonetic cue distributions, and the data sets that are analyzed below. The section after that defines and examines the _informativity_ of socio-indexical variables about cue distributions themselves (Study 1). The results of Study 1 show that, at a broad level, vowels show more talker variability than stop voicing. This is consistent with previous, impressionistic findings but is based on a principled measure that allows _direct_ comparisons between the two phonetic systems, and serves as a proof of concept that this measure can be applied in other domains. At a more fine-grained level, these results also show that this variability is _structured_ by some socio-indexical variables, but not all, and that this structure depends on how cues themselves are represented. The fact that structure in talker variability _exists_ does not necessarily mean that it will be _useful_ in speech recognition---or, conversely, that ignoring it will be harmful---which motivates the notion of _utility_ that is defined and evaluated in Study 2. The results of Study 2 show, first, that informativity largely predicts utility: talker-specific cue distributions provide a consistent advantage over nearly every less-specific grouping of talkers, and groupings that were more informative than expected by chance also provide (often modest) improvements in successful recognition. Second, Study 2 finds that these gains in utility are often rather modest. Third, and relatedly, large differences in informativity do not always lead to similarly large differences in utility. Finally, in the general discussion I review the implications of these findings for understanding how listeners track talker variability in order to understand speech more robustly. On the one hand, these results suggest that there are meaningful groupings of talkers exist for listeners to learn from their experience, and that doing so can make speech perception more robust. On the other hand, they show that not every socially-indexed way of grouping talkers is informative or useful for speech recognition per se, and that informativity and utility furthermore depend on the way that acoustic cues are represented. # The ideal adapter `r label("r2-ideal-adapter")` ```{r ideal-adapter-schematic, fig.width=10.5, fig.height=6, warning=FALSE, fig.cap="How well a listener can recognize the phonetic category [**A**, e.g., /s/ vs. /ʃ/; loosely based on @Newman2001] a talker is producing depends on what the listener knows about the underlying cue distributions (**B**). These distributions _vary_ across talkers, which results in variability in the best category boundary. Each talker's cue distributions can be characterized by their _parameters_ (**C**; e.g. the mean of /s/, mean of /ʃ/, variance of /s/, etc.; together denoted θ). Each point in **C** corresponds to a pair of distributions in **B** and one category boundary in **A**. _Groups_ of talkers are thus distributions in this high-dimensional space (**C**, ellipses); marginalizing (averaging) over a group smears out the category-specific distributions (thick lines in **B**) and thus the category boundary (**A**). Thus, Jose's /s/ and /ʃ/ are best classified using his own distributions (purple), in the sense that this leads to a steeper boundary at a different cue value compared to the boundary from the _marginal_ distributions over all talkers (gray) or other males (light blue)."} theme_schematic <- theme_cowplot() + theme(legend.position="none", plot.title= element_text(hjust=0)) theme_blank_y_axis <- theme(axis.title.y = element_blank(), axis.line.y = element_blank(), axis.ticks.y=element_blank(), axis.text.y=element_blank()) ## generate parameters n <- 30 s_sh_mu <- 0.8 s_sh_offset <- -1.5 sd_0 <- 1 sd_sd <- 0.1 s_sh_sd <- 0.02 gender_offset <- 3 who_labels = c("Judith (same lang)", "Jim (same gender+lang)", "Jose") %>% factor(., levels=.) whos = tibble(who = who_labels, who_short = c("language", "gender", "Jose")) who_colors <- set_names(c("#888888", "#00D2F7", "#7A78FF"), who_labels) set.seed(1002) params <- tibble(talker = seq_len(n), gender = rep(c("Male", "Female"), length.out=n), who = case_when( talker == 1 ~ who_labels[3], gender == "Male" ~ who_labels[2], TRUE ~ who_labels[1] ), mu_sum = rnorm(n) + gender_offset*(gender == "Female"), mu_diff = rnorm(n, mean = s_sh_offset, sd=.7), mu_s = mu_sum - mu_diff, mu_ʃ = mu_sum + mu_diff, sd_s = sd_0 + rnorm(n, sd=sd_sd), sd_ʃ = sd_s + rnorm(n, sd=s_sh_sd)) %>% mutate_at(vars(mu_s:sd_ʃ), funs( ./6 * (5900-5400) )) %>% mutate_at(vars(mu_s:mu_ʃ), funs( . + 5400)) ## PLot param space p_params <- ggplot(params, aes(x=mu_s, y=mu_ʃ, color=who)) + stat_ellipse(data = . %>% filter(gender=="Male"), type="norm", geom="polygon", fill=who_colors[2], color=NA, alpha=0.2) + stat_ellipse(data = . %>% mutate(who=NULL), type="norm", geom="polygon", fill=who_colors[1], color=NA, alpha=0.2) + geom_point(aes(size=who)) + geom_abline() + coord_equal() + geom_label_repel(data = . %>% group_by(who) %>% mutate(who2 = ifelse(row_number() == 1, as.character(who), "")), aes(label=who2), hjust = 0, box.padding = 2, label.size=NA, fill = gray(1, alpha=0.7)) + scale_color_manual(values = who_colors) + scale_fill_manual(values = who_colors) + scale_size_manual(values = c(1.5, 1.5, 3)) + theme_schematic + labs(x = "θ₁ = mean COG of /s/", y = "θ₂ = mean COG of /ʃ/") + annotate("text", x=5650, y=5100, label="p(θ | male)", color=who_colors[2], hjust=0) + annotate("text", x=5800, y=5250, label="p(θ | Am. Eng.)", color=who_colors[1], hjust=0) ## generate PDFs params_by_cat <- params %>% gather("param", "value", mu_s:sd_ʃ) %>% separate(param, c("param", "category"), "_") %>% spread("param", value) params_to_pdfs <- . %>% mutate(pdfs = map2(mu, sd, ~ tibble(cog = seq(min(mu-3*sd), max(mu+3*sd), length.out=1000), lhood = dnorm(cog, mean=.x, sd=.y)))) %>% unnest(pdfs) pdfs <- params_by_cat %>% params_to_pdfs param_samples <- params_by_cat %>% unnest(cog=map2(mu, sd, rnorm, n=1000)) group_pdfs <- c(quo(TRUE), quo(gender == "Male"), quo(talker == 1)) %>% map2(who_labels, ~ filter(param_samples, !!.x) %>% group_by(category) %>% summarise(mu=mean(cog), sd=sd(cog)) %>% mutate(who=.y)) %>% bind_rows() %>% params_to_pdfs ## plot PDFs pdf_labs <- group_pdfs %>% left_join(whos) %>% group_by(who, category) %>% filter(lhood == max(lhood), who_short == "Jose" | category == "s") %>% mutate(lhood_lab = map2_chr(category, who_short, ~ paste("p(COG | ", .x, ", ", .y, ")", sep=""))) p_pdfs <- ggplot(pdfs, aes(x=cog, y=lhood, linetype=category)) + geom_line(data = . %>% filter(talker != 1), aes(group=interaction(talker, category), color=who), alpha=0.2) + geom_line(data=group_pdfs, aes(group=interaction(who, category), color=who, size=who)) + geom_label(data = pdf_labs, aes(label=lhood_lab, color=who), label.size=NA, hjust=0, vjust=0, nudge_x=25, fill=gray(1, 0.8)) + scale_color_manual(values = who_colors) + scale_size_manual(values = set_names(c(1, 1, 1.5), who_labels)) + labs(x = "Spectral center of gravity (COG; Hz)") + theme_schematic + theme_blank_y_axis ## generate classification functions pdfs_to_posteriors <- . %>% select(-mu, -sd) %>% spread(category, lhood) %>% filter(!is.na(s), !is.na(ʃ), cog > quantile(cog, 0.2), cog < quantile(cog, 0.8)) %>% mutate(p_s = s / (s+ʃ)) post_labels <- group_pdfs %>% pdfs_to_posteriors %>% left_join(whos) %>% mutate(post_lab = map_chr(who_short, ~ paste0("p(s | COG, ", .x, ")"))) %>% group_by(who) %>% mutate(near80 = abs(cog - 5430), post_lab = ifelse(near80 == min(near80), post_lab, "")) ## plot classifcation functions p_posteriors <- pdfs %>% # filter(ifelse(category == "s", cog-1 <= mu, cog+1 >= mu)) %>% pdfs_to_posteriors %>% ggplot(aes(x=cog, y=p_s, color=who)) + geom_line(aes(group=talker), alpha=0.2) + geom_line(aes(size=who), data = pdfs_to_posteriors(group_pdfs)) + geom_label(data=post_labels %>% filter(post_lab != ""), aes(label=post_lab), hjust=0, vjust=1.1, label.size=NA, fill=gray(1, alpha=0.7)) + scale_size_manual(values = set_names(c(1, 1, 1.5), who_labels)) + scale_color_manual(values = who_colors) + theme_schematic ## put the whole thing together plot_grid(p_posteriors + theme_blank_y_axis + theme(axis.title.x = element_blank(), axis.text.x= element_blank()) + lims(x = range(pdfs$cog)) + ggtitle("A: Classification of categories /s/ vs. /ʃ/", subtitle="posterior probability p(/s/ | COG, group)"), p_pdfs + ggtitle("B: Cue distributions", subtitle="likelihood p(COG | category, ɡroup)"), ncol=1, rel_heights=c(1, 1.1)) %>% plot_grid(., p_params + ggtitle("C: Talkers in distribution parameter space", subtitle=expression(paste("p(θ = ", group("[", list(μ[s], μ[ʃ], σ[s], ...), "]"), " | group)"))), ncol=2) ``` This section briefly introduces the logic of the ideal adapter model [for a more detailed introduction, see @Kleinschmidt2015]. Figure @fig:ideal-adapter-schematic provides a hypothetical illustration of this logic for an /s/-/ʃ/ contrast [loosely based on @Newman2001]. Both the informativity and the utility of a particular grouping of talkers is defined based on the linguistic cue-category mappings for each implied group. In the ideal adapter, like other ideal observer models, these cue-category mappings are represented as _category-specific cue distributions_, or the probability distribution of observable cues associated with each underlying linguistic category [phoneme or phonetic category; @Clayards2008; @Feldman2009; @Norris2008]. This is a direct consequence of how these models treat perception as a process of inference under uncertainty, formalized using Bayes rule: \[ p(\mathrm{category}=c | \mathrm{cue}=x) \propto p(\mathrm{cue}=x | \mathrm{category}=c) p(\mathrm{category}=c) \] That is, the _posterior_ probability of category $c$ given an observed cue value $x$ is proportional to the _likelihood_ that that particular cue value would be generated if the talker intended to say $c$, $p(x | c)$, times the _prior_ probability, or how probable category $c$ is in the current context (regardless of the observed cue value). For good performance, the likelihood function $p(x|c)$ should be as close as possible to the actual distribution of cues that correspond to category $c$ in the current context. However, these cue distributions potentially _differ_ across contexts, due to talker variability (Figure @fig:ideal-adapter-schematic, B), and thus the ideal category boundaries can differ as well (Figure @fig:ideal-adapter-schematic, A). Listeners thus must also take into account their limited knowledge about the cue distributions, given what they know about who is currently talking. The central insight of the ideal adapter [@Kleinschmidt2015] is that these uncertain beliefs can be modeled as another probability distribution, over the parameters of the category-specific distributions themselves $\theta$, given a talker of type $t$: $p(\theta | t)$ (Figure @fig:ideal-adapter-schematic, C). The type of talker could be a member of some socio-indexical group like $t=\mathrm{male}$ (blue), or a specific individual $t=\mathrm{Jose}$ (purple), or even a generic speaker $t=\mathrm{American\ English}$ (gray). In each case, the listener will have more or less uncertainty about the cue distributions that this type of talker will produce. Treating speech recognition as inference under uncertainty allows us to formalize how this additional uncertainty about the category-specific cue distributions affects speech recognition by _marginalizing_ over possible cue distributions in order to compute the likelihood (Figure @fig:ideal-adapter-schematic B, thick lines): \[ p(x | c, t) = \int p(x | c, \theta) p(\theta | t) \mathrm{d}\theta \] Marginalization is essentially a weighted average of the likelihood under each possible set of cue distributions $p(x|c,\theta)$, weighted by how likely those particular distributions are for a talker of a particular type, $p(\theta | t)$. As an example, the likelihood of a male talker producing /s/ with a spectral center of gravity (COG) of exactly 5500Hz is determined by averaging the likelihood of that COG value under a distribution with a mean of 5400Hz and a standard deviation of 80Hz, with the likelihood under every other possible combination of means and variances, each weighted by how likely it is that a male talker would produce that particular distribution for /s/. Thus, if a listener has grouped together all the male talkers they have previously encountered, they can use their knowledge of the group-level cue distributions to recognize speech from other male talkers they might encounter in the future. The properties of these socio-indexically conditioned, category-specific cue distributions provide a natural way to measure how much a particular socio-indexical grouping variable is informative or useful with respect to a particular set of phonetic categories. As detailed below in Studies 1 and 2, _informativity_ is defined based on the group-level _distributions_ themselves (e.g., thick lines in Figure @fig:ideal-adapter-schematic B), while _utility_ is defined based on the classification functions/category boundaries those distributions imply (e.g., thick lines in Figure @fig:ideal-adapter-schematic A). These measure are derived directly from treating speech perception as a process of inference under uncertainty and talker variability. # General methods ## Measuring distributions The socio-indexically conditioned, category-specific cue distributions were estimated in the following way. First, it is assumed that each phonetic category can be modeled as a normal distribution over cue values (stop voicing as univariate distributions over VOT, and vowels as bivariate distributions of F1 and F2). Each distribution is parameterized by its mean and covariance matrix (or, equivalently, variance in the case of VOT). Next, the mean and covariance were fitted to the samples of cue values from the corpora using the standard, unbiased estimators for the mean and covariance[^cov]. This was done separately for each group/talker, including the group of all talkers (to estimate the marginal distributions). For example, for gender, one /æ/ distribution was obtained from all the tokens from male talkers, and one from all tokens from female talkers. Likewise, for dialect, one distribution was obtained based on all tokens from talkers from the North dialect region, another one from tokens from Mid-Atlantic talkers, and so on. [^cov]: Using the `mean` and `cov` functions in R 3.4.1 [@R2017]. Assuming that each category is a normal distribution is not a critical part of the proposed approach, but rather a standard and convenient assumption. In particular, the normal distribution has a small number of parameters and this allows us to efficiently estimate the distribution for each category with a limited amount of data (e.g., five tokens per talker-level vowel distribution). But the proposed method is fully general, and works with _any_ distribution (including discrete or categorical distributions for phonotactics, syntax, etc.). An additional simplifying assumption here is that there is no further, talker-specific learning that occurs. In the ideal adapter, group-conditioned cue distributions reflect the _starting point_ for talker- or situation-specific distributional learning. As I discuss below, the measures I present are best thought of as a _lower-bound_ on informativity/utility that is much easier to estimate from small quantities of speech production data. ## Data sets I analyze the informativity and utility for two types of phonological contrasts, vowels (e.g., /æ/ and /ɛ/) and word-initial stop voicing (e.g., /b/ vs. /p/). I chose these two types of contrast for two reasons. First, for American English the primary acoustic-phonetic cues to vowel identity (formants) and stop voicing (voice onset timing or VOT) are broadly thought to exhibit very different patterns of variability across talkers and talker groups. Indeed, there is at least qualitative evidence in support of this assumption. For example, vowel formants in American English exhibit substantial variability conditioned on the gender and the regional background of the talker [@Peterson1952; @Hillenbrand1995; @Clopper2005; @Labov2006, among others]. On the other hand, word-initial stop VOTs appear to be less variable across talkers in American English. Specifically, cross-talker variation in voiceless word-initial stop VOT is roughly _half_ of within-category variation: visual `r label("vot-within-between-sd")` inspection of Figure 1 in @Chodroff2015 suggests that the mean standard deviation of /p/ is around 20ms, while the standard deviation of the mean VOT of /p/ is less than 10ms (based on a range of 40ms). Cross-talker variability in vowel formants is approximately _double_ the within-category variability [based on Figure 4 in @Hillenbrand1995]. This qualitative difference, and the lack of direct apples-to-apples comparisons between them, makes vowels and word initial stops an interesting combination of contrasts to compare for the present purpose. Second, while the overall level of talker variability for word-initial stop VOTs is lower, there is some evidence that it is nevertheless _structured_ by age, gender, and dialect, among other factors [@Torre2009; @Stuart-smith2015]. I thus expect to find both 1) significant differences in the overall informativity/utility of any socio-indexical variable when comparing across the two types of contrasts (vowels and word initial stops), and 2) significant differences in the informativity/utility within either contrast type when comparing across socio-indexical variables. For vowels, I further assess the consequences of _normalization_ on the informativity/utility of different socio-indexical variables. Vowel formants vary based on physiological differences between talkers (e.g., the size of the vocal tract), and there is evidence that vowel recognition draws on normalized formants---transformations of the raw formant values that adjust for physiological differences [e.g. @Lobanov1971; @Loyd1890; @Monahan2010; for review, see @WeatherholtzInPress]. This approach allows us to compare the informativity/utility of socio-indexical variables for raw vs. normalized vowel formants. The particular datasets I analyze here are drawn from three publicly available sources: two collections of elicited vowel productions [@Heald2015; @Clopper2005] and one of word-initial voiced and voiceless stops from unscripted speech [@Nelson2017].[^r-data] These sources were selected because they are annotated for the acoustic-phonetic cues that are standardly considered to be the primary cues to the relevant phonological contrasts (i.e., formants for the vowel productions, voice onset timing for the stop productions), measured under sufficiently controlled conditions to allow meaningful comparisons across talkers, and contain enough tokens from multiple phonetic categories produced by a sufficiently large and diverse population of talkers. The last property is particularly important for the goal of assessing the _joint_ statistical contingencies between socio-indexical variables, linguistic categories, and acoustic-phonetic cues. [^r-data]: All three are available as R packages on Github: [`nspvowels`](https://github.com/kleinschmidt/nspvowels), [`healdvowels`](https://github.com/kleinschmidt/healdvowels), and [`votcorpora`](https://github.com/kleinschmidt/votcorpora) (which contains additional VOT measurements from other sources as well). ### Vowels ```{r nsp-data, cache=TRUE, results='hide'} nsp_vows <- nspvowels::nsp_vows %>% ungroup() %>% mutate(Marginal='all', Dialect_Sex = paste(Sex, Dialect, sep='_'), Vowel=Vowel_ipa) nsp_vows_lob <- nsp_vows %>% group_by(Talker) %>% mutate_at(c("F1", "F2"), funs(. %>% scale() %>% as.numeric())) %>% ungroup() ## check normalization nsp_vows_lob %>% gather(formant, value, F1:F2) %>% group_by(Talker, formant) %>% summarise_at("value", funs(mean, sd)) %$% assert_that(all.equal(mean, rep(0, length(mean))), all.equal(sd, rep(1, length(sd)))) vowel_data <- data_frame(cues = c("F1×F2 (Hz)", "F1×F2 (Lobanov)"), contrast = "Vowels (NSP)", data = list(nsp_vows, nsp_vows_lob)) vowel_groupings <- c('Marginal', 'Sex', 'Dialect', "Dialect_Sex", 'Talker') vowel_data_grouped <- apply_groupings(vowel_data, vowel_groupings) token_per_vow <- nsp_vows %>% group_by(Talker, Vowel) %>% tally() %$% mean(n) n_talkers <- nsp_vows %>% group_by(Talker) %>% summarise() %>% tally() n_per_dialect_sex <- nsp_vows %>% group_by(Dialect, Sex, Talker) %>% summarise() %>% tally() %$% unique(n) n_dialect <- nsp_vows %$% Dialect %>% unique() %>% length() ``` For vowels, I used two datasets. The first is from the Nationwide Speech Project [NSP; @Clopper2006b]. I analyzed first and second formant frequencies (F1×F2, measured in Hertz) recorded at vowel midpoints in isolated, read "hVd" words (e.g., "head", "hid", "had", etc.). This corpus contains `r n_talkers` talkers, `r n_per_dialect_sex` male and female from each of `r n_dialect` regional varieties of American English: North, New England, Midland, Mid-Atlantic, South, and West [see map and summary of typical patterns of variation in @Clopper2005; regions based on @Labov2006]. `r label("nsp-dialect1")` Each talker provided approximately `r round(token_per_vow, 1)` repetitions of each of 11 English monophthong vowels /`r nsp_vows %>% pull(Vowel) %>% levels() %>% lift_dv(paste, sep=", ")()`/, for a total of `r nrow(nsp_vows)` observations. Talkers were recorded in the early 2000s, and were all of approximately the same age, so age-graded sound changes are not likely to be detectable from this dataset. `r label("age-graded")` ```{r heald-data} # gender isn't marked but we can figure it out based on the formant values and # the fact that there's three males and five females: male_speakers <- healdvowels::by_speaker %>% unnest(map(model, ~ .$mu %>% as.list() %>% as.data.frame())) %>% group_by(Vowel) %>% arrange(F1) %>% filter(row_number() <= 3) %>% group_by(Speaker) %>% summarise(n=n()) %>% arrange(n) %>% tail(3) %T>% print() %>% pull(Speaker) #' Sample from a multivariate normal model #' #' @param n number of samples #' @param model list with mu and Sigma (mean and covariance) #' #' @return a matrix of samples, with column names taken from the names of the mu #' vector. #' r_model <- function(n, model) { x <- rmvnorm(n, mean=model$mu, sigma=model$Sigma) colnames(x) <- names(model$mu) x } # create gender models heald_by_sex <- healdvowels::by_speaker %>% mutate(Gender = ifelse(Speaker %in% male_speakers, "m", "f"), samples = map(model, r_model, n=1000)) %>% unnest(map(samples, as_data_frame)) %>% nest() %>% apply_groupings("Gender") %>% train_models_grouped(category_col = "Vowel", cue_cols=c("F1", "F2")) strip_f3 <- function(models) { models %>% mutate(model = map(model, update_list, mu = ~mu[1:2], Sigma = ~Sigma[1:2, 1:2])) } heald_models <- healdvowels::models %>% mutate(models = map(models, strip_f3)) %>% bind_rows(heald_by_sex) %>% mutate(cues = "F1×F2 (Hz)", contrast = "Vowels (HN15)") %>% filter(grouping %in% grouping_levels) ``` The second is from a study by @Heald2015. Eight talkers (5 female and 3 male) produced 90 repetitions of 7 monophthong American English vowels /`r healdvowels::by_speaker %>% pull(Vowel) %>% factor(levels=levels(nsp_vows$Vowel)) %>% unique() %>% sort() %>% lift_dv(paste, sep=", ")()`/ over 9 sessions. Due to Human Subject Protocols, this dataset is only available in the form of F1×F2 means and covariance matrices for each category, conditioned on talker, gender, and the marginal distributions. Unlike the NSP, the talkers recorded by @Heald2015 are all from the same American English dialect region (Inland North), and so there is likely less talker variability overall relative to the NSP talkers. #### Vowel normalization One main goal of this paper is to assess not just the degree but the _structure_ of talker variability. Much of the variability in vowel formants is due to physiological differences between talkers' vocal tract size, which increase or decrease all resonant frequencies together [@Loyd1890]. This produces global shifts in talkers' vowel spaces, that apply relatively uniformly across all vowels. In contrast, sociolinguistic factors like dialect can affect the cue-category mapping for individual vowels. Even gender-based differences in the cue-category mappings of vowels have been found to vary cross-linguistically, suggesting that they are partially stylistic [@Johnson2006]. In order to assess how much these category-general shifts contribute to talker variability in vowel formant distributions, I analyze formant frequencies from the NSP[^healdnorm] represented in raw Hz, and also in Lobanov-normalized form. Lobanov normalization z-scores F1 and F2 separately for each talker [@Lobanov1971], which effectively aligns each talker's vowel space at its center of gravity, and scales it so they have the same size (as measured by standard deviation). This controls for overall offset in formant frequencies caused by varying vocal tract sizes (from both gender differences and individual variation). It does this while preserving the structure of each talker's vowel space, so that (for instance) dialect-specific vowel shifts are maintained, as we will see below. `r label("r1-norm-methodological")` Note that this is one of many possible normalization methods [see @Flynn2011; @Adank2004], and it is used here as a methodological tool, rather than a cognitive model of how normalization might work itself. The selection of this particular normalization method was driven primarily by methodological constraints: it provides good alignment of talker's overall vowel spaces, and does not require additional cues that are not included in our data sources [like fundamental frequencies and higher formants required by vowel-intrinsic normalization methods @Flynn2011; @WeatherholtzInPress]. Normalization and learning (adaptation) are often framed as _alternative_ models for how listeners cope with talker variability, but they are not mutually exclusive [@WeatherholtzInPress] and "hybrid models" may even be possible (as I briefly discuss in the general discussion). [^healdnorm]: Without access to the raw data, it is not possible to normalize the @Heald2015 vowels. ### Stop voicing ```{r vot-data, cache=TRUE} vot <- votcorpora::vot %>% filter(source == 'buckeye') %>% rename(Talker = subject, Sex = sex, Age = age_group) %>% group_by(phoneme, Talker) %>% mutate(Token = row_number(), cues = 'VOT', contrast = "Stop voicing") %>% ungroup() %>% mutate(Marginal = 'all') vot_by_place <- vot %>% group_by(place, cues, contrast) %>% nest() vot_groupings <- c('Marginal', 'Sex', 'Age', 'Talker') vot_by_place_grouped <- apply_groupings(vot_by_place, vot_groupings) n_vot_talkers <- vot %>% group_by(Talker) %>% summarise() %>% nrow() n_vot_per_talker <- vot %>% group_by(phoneme, voicing, place, Talker) %>% tally() ``` I also analyzed data on word-initial stop consonant voicing in conversational speech from the Buckeye corpus [@Pitt2007; extracted by @Nelson2017; @Wedel2018]. @Nelson2017 manually measured VOT for `r nrow(vot)` word initial, stops with labial (/p,b/), coronal (/t,d/), or dorsal (/k,g/) places of articulation. Of these, `r vot %>% filter(voicing=='voiced') %>% nrow()` were voiced and `r vot %>% filter(voicing=='voiceless') %>% nrow()` were voiceless. Data came from `r n_vot_talkers` talkers, who were (approximately) balanced male and female and younger than 30/older than 40 years (Table @tbl:talkers-per-group). On average, each talker produced `r round(mean(n_vot_per_talker$n))` tokens for word-initial stop phoneme (range of `r min(n_vot_per_talker$n)` -- `r max(n_vot_per_talker$n)`). @Nelson2017 excluded words with more than two syllables, function words, as well as words that began an utterance, followed a filled pause, disfluency, or another consonant. They also excluded tokens with VOT or closure length "more than 3 standard deviations from the speaker-specific mean for that stop" [@Nelson2017 p. 8]. They did not, unlike many previous studies on VOT, exclude words with complex onsets (a stop followed by a liquid or a glide). In modeling VOT as a cue to voicing, I chose to model each place of articulation separately. This is because there is some variation in VOT as a result of place of articulation, and treating, for instance, voiceless tokens from all three places as coming from the same distribution could obscure talker-level variation and bias the results against detecting talker- or group-level variation in VOT. Moreover, VOT in English can vary as a result of speaking rate, both at the level of the talker and individual tokens [@Sole2007]. In principle, it would be interesting to investigate the effect of using normalized VOT. However, in order to meaningfully compare with normalized vowel formants investigated here, a token-extrinsic (or talker-level) normalization procedure is needed, because a token-intrinsic procedure would eliminate token-to-token variation in speaking rate as well as overall talker effects, while the Lobanov normalization used for vowels eliminates only talker-level effects. Using a Lobanov-like z-scoring technique may lead to artifacts because of the large differences in the variance of voiced and voiceless distributions. As a result, investigating the effect of normalization on informativity and utility for voicing is left for future work. `r label("vot-normalization")` ## Socio-indexical grouping variables | Corpus | Vowels (NSP) | Vowels (HN15) | Stop voicing (VOT) | |----------------+-------------------------------+---------------------------------+-----------------------------------| | Marginal | **1** group of **48 talkers** | **1** group of **8** | **1** group of **24** | | Age | N/A | N/A | 2 groups of 10 and 14 | | Gender | **2** groups of **24** | **2** groups of **3** and **5** | **2** groups of **11** and **13** | | Dialect | **6** groups of **8** | N/A | N/A | | Dialect+Gender | **12** groups **4** | N/A | N/A | | Talker | **48** groups | **8** groups | **24** groups | Table: Socio-indexical variables analyzed here, and distribution of talkers across groups in each corpus. See below for more detail on each of the corpora. {#tbl:talkers-per-group} Based on the variables annotated in the available data, I consider cue distributions for each phonetic category conditioned on the following socio-indexical grouping variables, roughly in order of specificity (number of talkers in each group): * __Marginal__: control grouping, which includes all tokens for the category from all talkers. This serves as a baseline against which more specific group distributions can be compared, and as a lower bound for speech recognition accuracy. * __Gender__: coded as male/female for both vowels and stop voicing, allowing us to compare the role of gender-specific variation for two different contrasts. * __Age__: coded as older than 40/younger than 30 for VOT (in the Buckeye corpus). Not applicable to vowels, because the talkers are uniformly young by this cutoff. * __Dialect__: the NSP contains data from talkers from six dialect regions (see below for details). Not applicable to VOT or to vowels from @Heald2015. * __Dialect+Gender__: @Clopper2006b found that gender modulates dialect differences, so I also examined cue distributions conditioned on dialect and gender together (12 levels). * __Talker__: for all corpora, talker-specific cue distributions serve as an upper bound on informativity and utility. Note that when considering one socio-indexical grouping like age, this method ignores other grouping variables dialect, gender, or talker. That is, when asking how informative or useful the variable of age is, we are asking what a listener would gain by knowing *only* the age (group) of an unfamiliar talker. Next, I present two studies which apply the two measures of structure in talker variability to these datasets. First, I show how to assess the _informativity_ of these different grouping variables about the cue distributions themselves. Then, I assess the _utility_ of these different grouping variables, in terms of how they affect the accuracy of correct recognition. # Study 1: How _informative_ are socio-indexical groups about vowel formant and VOT distributions? ```{r overlap-figure, fig.width=7.4, fig.height=5.6, fig.cap="Gender-specific distributions of vowel formants for /i/ appear to diverge from the overall (marginal) distributions (A), whereas for VOT the gender-specific distributions are essentially indistinguishable from the marginal distributions. Intuitively, this makes gender informative for vowel formants, but not for VOT [see also vowels in @Perry2001; vs. VOT in @Morris2008]. The proposed approach formalizes this intuition in a quantitative measure that can be applied to directly compare talker variability across different cues, phonetic contrasts, and socio-indexical grouping variables. Vowel data is drawn from the Nationwide Speech Project, and VOT from the Buckeye corpus (see below for more details)."} marg_color <- "grey50" vowel_p <- nspvowels::nsp_vows %>% filter(Vowel_ipa == "i") %>% mutate(Gender = forcats::fct_recode(Sex, Male="m", Female="f")) %>% ggplot(aes(x=F1)) + stat_density(data = .%>%select(-Gender), color=marg_color, fill=NA) + stat_density(aes(color=Gender), fill=NA, position="identity") + facet_wrap(~Gender) + lims(x = c(150, 550)) + theme(legend.position="hide", strip.background=element_blank(), strip.text=element_blank(), axis.line.y=element_blank(), axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank(), panel.grid.major.y=element_blank() ) + geom_text( data=tribble( ~Gender, ~F1, ~density, ~label, "Female",400, 0.0075, "p(F1 | /i/, Female)", "Male", 330, 0.0075, "p(F1 | /i/, Male)"), aes(color=Gender, y=density, label=label), vjust=1, hjust=0) + geom_label(data=data.frame(F1=420, density=0.003, label="p(F1 | /i/)"), aes(x=F1, y=density, label=label), color=marg_color, alpha=0.75, label.size=NA, hjust=0, vjust=1) + labs(x="First formant (F1, Hz)") + ggtitle(label="", subtitle="Distribution of F1 of /i/ by gender") + scale_color_manual(values = c("#FF20AF", "#00D2F7")) ## vot_p <- votcorpora::vot %>% filter(source=="buckeye", voicing == "voiceless") %>% mutate(Gender = forcats::fct_recode(sex, Male="m", Female="f")) %>% ggplot(aes(x=vot)) + stat_density(data = .%>%select(-Gender), color=marg_color, fill=NA) + stat_density(aes(color=Gender), fill=NA, position="identity") + facet_wrap(~Gender) + lims(x=c(0, NA)) + theme(legend.position="hide", strip.background=element_blank(), strip.text=element_blank(), axis.line.y=element_blank(), axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank(), panel.grid.major.y=element_blank() ) + geom_text( data=tribble( ~Gender, ~vot, ~density, ~label, "Female",80, 0.014, "p(VOT | voiceless, Female)", "Male", 80, 0.014, "p(VOT | voiceless, Male)"), aes(color=Gender, y=density, label=label), vjust=1, hjust=0) + geom_label(data=data.frame(vot=100, density=0.007, label="p(VOT | voiceless)"), aes(x=vot, y=density, label=label), color=marg_color, alpha=0.75, label.size=NA, hjust=0, vjust=1) + labs(x="Voice-onset time (VOT, ms)") + ggtitle(label="", subtitle="Distribution of voiceless stop VOT by gender") + scale_color_manual(values = c("#FF20AF", "#00D2F7")) ## plot_grid(vowel_p, vot_p, ncol=1, align=TRUE, labels=c("A", "B")) ``` The first method I propose for assessing structure in talker variability is to measure how _informative_ socio-indexical variables are about the category-specific cue distributions. One way to quantify how informative a socio-indexical grouping variable is about cue distributions is by comparing the group-level cue distributions with the _marginal_ distribution of cues from all groups. The reason for this is that if a socio-indexical grouping variable (e.g., gender) is _not_ informative about cue distributions, then the cue distributions for each group (e.g., male and female talkers) will be indistinguishable from the overall "marginal" cue distribution (e.g., Figure @fig:overlap-figure B). If, on the other hand, a socio-indexical variable _is_ informative about cue distributions, then the distribution for each group will deviate substantially from other groups, and by extension from the marginal distribution as well (Figure @fig:overlap-figure A). The particular measure I use to compare distributions is the Kullback–Leibler (KL) divergence. This measure is intuitively similar to the proportion of variance explained by a socio-indexical grouping variable [e.g., for gender and region in Dutch vowels @Adank2004; for various contextual variables including talker in American fricatives @McMurray2011a]. However, it is a more general approach that does not require that we assume that the underlying distributions are normal distributions, and can be applied even to categorical variables (like distributions of words or syntactic structures). It also naturally extends to multidimensional cue spaces, taking into account the correlations between cues, and supporting comparisons to other cue spaces. ## Methods The KL divergence is a measure of how much a probability distribution $Q$ diverges from a "true" distribution $P$. In this case, the distributions are over phonetic cues (VOT or F1×F2), and the "true" distribution is the distribution conditioned on a socio-indexical variable (e.g., gender) while the comparison distribution is the marginal distribution, which ignores any socio-indexical grouping. Intuitively, the KL divergence measures a loss of information when you use a code optimized for $Q$ to encode values from $P$. For instance, the frequencies of letters in English sentences is very different from that of French sentences. If we use a binary representation of letters that is optimized to make the representation of French sentences as short as possible (while still unambiguous), applying the same representation to English sentences will result in longer forms than a code that is optimized for the frequencies of English letters (and vice versa). That difference is the KL divergence (measured in bits) of the distribution of letters in French from that of English. Similarly, the code optimized for the marginal distribution of letters from both French and English _combined_ will result in sub-optimal encoding for _both_ English and French sentences, and the degree of sub-optimality provides a measure of how much the language matters in understanding the distribution of letters. Here, the KL divergence is used in an analogous way to measure the _informativity_ of a socio-indexical variable (e.g., gender) with respect to phonetic cue distributions (e.g., VOT). Specifically, informativity is defined as the KL divergence of the marginal distribution of phonetic cues (e.g., $p(\mathrm{VOT} | \mathrm{category})$) from each of the socio-indexically-conditioned distributions (e.g., $p(\mathrm{VOT} | \mathrm{category}, \mathrm{gender})$). ### Procedure For each phonological category (e.g., /b/), I calculate the KL divergence of each group's cue distribution (e.g., /b/-specific VOTs for male vs. female talkers) from the marginal distribution of cues from all talkers (e.g., /b/-specific VOTs regardless of the talker's gender). I then average across the category-specific KL divergences for all phonological categories (e.g., /b,p,t,d,k,g/) to calculate the average KL divergence for that phonetic cue (e.g., VOT) and group (e.g., male). Finally, for each grouping variable, I further average these group-specific divergences (e.g., male and female) to get the overall informativity for the grouping variable (gender). I average over categories for two reasons. First, it's mathematically convenient, because the KL between two normal distributions can be computed in closed form, whereas for a mixture of multiple distributions it would have to be estimated through computationally costly numerical integration. Second, averaging over categories naturally adjusts for differences in the number of vowel (7--11) and stop voicing (2) categories. The resulting informativity scores can be evaluated with a permutation test, by randomly shuffling the group labels 100 times and repeating the calculation. The resulting distribution of shuffled scores is an estimate of the _null_ distribution of informativity for the same cues, which controls for the number of talkers in each group and the intrinsic properties of the cue distributions. For the grouping variable of Talker, the labels are permuted by token; for all other grouping variables they are shuffled at the talker level. ### Technical details The KL divergence measures how much better the "true" distribution predicts data that is actually drawn from that distribution than the candidate distribution predicts it. Mathematically, the KL divergence of $Q$ from $P$ is defined to be $$DL(Q||P) = \int p(x) \log \frac{p(x)}{q(x)} \mathrm{d}x$$ {#eq:kl} (with density functions $q$ and $p$ respectively). The $\log p(x)/q(x)$ is how much more (or less, if negative) probability $P$ assigns to a point $x$ than $Q$. The KL divergence is the average of this over all data that could be generated by $P$, weighted by the probability that each $x$ would be generated by $P$, $p(x)$. The KL divergence increases as $Q$ diverges more from $P$, and has a minimum value of zero, which is only achieved when $P=Q$, i.e., when the two distributions are identical [@Mackay2003, p. 34]. In this case, $P=\mathcal{N}_G$ is a multivariate Normal cue distribution conditioned on a socio-indexical group, with mean $\mu_G$ and covariance $\Sigma_G$, while $Q=\mathcal{N}_M$ is the marginal (not conditioned on group) cue distribution with mean $\mu_M$ and covariance $\Sigma_M$. With some simplification,[^gaus-kl] the KL divergence of the marginal from the group distribution works out to be $$ DL(\mathcal{N}_M || \mathcal{N}_G) = \frac{1}{2} \left( \mathrm{tr}(\Sigma_M^{-1} \Sigma_G) + (\mu_M - \mu_G) \Sigma_M^{-1} (\mu_M - \mu_G) - d + \log\frac{|\Sigma_M|}{|\Sigma_G|} \right) $$ {#eq:klnorm} where $d$ is the dimensionality of the distribution (i.e., 1 for stop VOTs and 2 for vowel F1×F2). The base of the logarithm in equation @eq:kl determines the units. For ease of interpretation, I report KL in bits, which corresponds to using base-2 logarithms in equation @eq:kl and dividing equation @eq:klnorm by $\log(2)$. [^gaus-kl]: See, for instance, [http://stanford.edu/~jduchi/projects/general_notes.pdf](), p. 13. The math is the same for the univariate special case, as with VOT. ```{r kl-helpers, cache=TRUE} #' Compute KL relative to reference grouping #' #' @param models a tbl containing models for each grouping level, with columns #' `grouping` (the grouping variable), `models` (a list column with tbls of #' trained models for each category × group combination, and any additional #' columns to be used to match up the reference grouping with the others #' (e.g., cue format, place of articulation for VOT, etc.) #' @param reference_grouping the grouping level that's used to calculate the KL #' divergence of the other levels (e.g., Marginal). #' @param category_col quoted name of the column that has the phonetic category #' in `models` #' #' @return a tbl with columns `grouping`, `group`, `reference_group`, the #' category col from the input, any other columns from the input, and `KL`. #' run_kl <- function(models_grouped, reference_grouping, category_col) { assert_that(has_name(models_grouped, "models"), has_name(models_grouped, "grouping")) assert_that(reference_grouping %in% models_grouped[['grouping']]) models_grouped %>% filter(grouping == reference_grouping) %>% mutate(reference_models = map(models, rename_, reference_group = 'group')) %>% select(-grouping, -models) %>% left_join(models_grouped %>% filter(grouping != reference_grouping)) %>% mutate(kl_from_reference = map2(models, reference_models, ~ left_join(.x, .y, by=category_col)%>% mutate(KL = map2_dbl(model.x, model.y, KL_mods)) %>% select_('group', 'reference_group', category_col, 'KL') ) ) %>% unnest(kl_from_reference) } ``` ```{r vowel-kl, cache=TRUE, dependson=c('kl-helpers', 'vowel-data')} vowel_kl <- vowel_data_grouped %>% train_models_grouped(category_col = 'Vowel', cue_cols = c('F1', 'F2')) %>% run_kl(category_col = 'Vowel', reference_grouping = 'Marginal') %>% filter(!is.na(KL)) # NAs come from one vowel ('uh') that # only has one token for one talker. ``` ```{r heald-vowel-kl, cache=TRUE, dependson=c('kl-helpers', 'heald-data')} vowels_heald_kl <- heald_models %>% run_kl(reference_grouping = "Marginal", category_col = "Vowel") ``` ```{r vot-kl, cache=TRUE, dependson=c('kl-helpers', 'vot-data')} vot_kl <- vot_by_place_grouped %>% train_models_grouped(category_col = 'voicing', cue_cols = 'vot') %>% run_kl(reference_grouping = 'Marginal', category_col = "voicing") ``` ### Hypothesis testing by permutation test To assess whether any particular KL divergence is different from chance, I re-ran the same analysis on 1000 random permutations of the dataset, where talkers are randomly re-assigned to groups (or tokens to talkers, for talker as a grouping variable). The permutation test $p$ value for a particular measure is the proportion of these randomly permuted data sets that led to a value of that measure that was as high or higher than the real assignment of talker to groups (or tokens to talkers). There are a number of advantages to this technique for directly estimating the distribution of the test statistic (informativity or KL divergence) under the null hypothesis that the assignment of talkers to groups does not matter. First, it controls for the differences in group size. For instance, in the NSP, there are 6 talkers per dialect, but 24 per gender. Fewer talkers means that there will be fewer tokens per category, which leads to more variable estimates and higher average diversion from the marginal distributions. Second, it accounts for the intrinsic asymmetry in KL divergence, which is always greater than 0. Third, it is flexible enough to support arbitrary test statistics, including the grouping variable-level summary score (average over groups), single-group score (averaged over phonetic categories), and individual group-category scores (e.g., particular dialect-vowel combinations). ```{r kl-permutation, cache=TRUE, dependson=c('kl-helper', 'vot-data')} ## need to control somehow for the number of talkers. use random permutations ## of groupings, by shuffling talkers #' Shuffle levels of a column in a tibble #' #' @param tbl Tibble with column to shuffle #' @param group Quoted name (string) of group column to shuffle #' @return tbl, with values of group column permuted shuffle_group <- function(tbl, group) { labels <- groups(tbl) group <- sym(group) group_labels <- tbl %>% group_by(!! group, add=TRUE) %>% summarize() %>% ungroup() %>% mutate(!! group := sample(!! group)) tbl %>% ungroup() %>% select(-one_of(map_chr(labels, as_string))) %>% left_join(group_labels, by=quo_name(group)) %>% group_by( !!! labels) } set.seed(1001) n_perm <- 1000 batch_n <- function(x, n) { split(x, ceiling(seq_along(x) * n / length(x))) } # now run the whole KL pipeline, shuffling the talkers... # can't do this easily on Heald data because don't have the underlying data... vot_kls_perm <- map( batch_n(seq_len(n_perm), 20), ~ future(map(.x, function(perm_iter) { vot_by_place_grouped %>% filter(grouping != "Talker") %>% mutate(data = map(data, shuffle_group, group="Talker")) %>% train_models_grouped(category_col = "voicing", cue_cols = "vot") %>% run_kl(reference_grouping = "Marginal", category_col = "voicing") %>% mutate(perm_iter = perm_iter) }))) %>% map(values) %>% flatten() %>% bind_rows() vowel_kls_perm <- map( seq_len(n_perm), ~ future(vowel_data_grouped %>% filter(grouping != "Talker") %>% mutate(data = map(data, shuffle_group, group="Talker")) %>% train_models_grouped(category_col = 'Vowel', cue_cols = c('F1', 'F2')) %>% run_kl(category_col = 'Vowel', reference_grouping = 'Marginal') %>% mutate(perm_iter = .x))) %>% map(values) %>% flatten() %>% bind_rows() %>% filter(!is.na(KL)) # NAs come from one vowel ('uh') that # only has one token for one talker. kl_perm_summary <- bind_rows(vot_kls_perm, vowel_kls_perm) %>% group_by(cues, contrast, grouping, perm_iter) %>% summarise(KL = mean(KL)) ``` ```{r talker-kl-perm, cache=TRUE, dependson=c('kl-helper', 'vot-data')} #' Shuffle grouping variables within groups defined by another column #' #' @param tbl Grouped tibble #' @param within Quote name (string) of column to define groups within which the #' native groups of tbl will be shuffled. #' @return a copy of \code{tbl} with all grouping columns shuffled within each unique #' value of \code{within}. #' shuffle_groups_within <- function(tbl, within) { gs <- groups(tbl) tbl %>% group_by(!! sym(within)) %>% mutate_at(map_chr(gs, as_string), sample) %>% group_by(!!! gs) } # permutate tokens within phonemes to do talker. vowel_marginal_models <- vowel_data_grouped %>% filter(grouping == "Marginal") %>% train_models_grouped(category_col="Vowel", cue_cols=c("F1", "F2")) plan(multicore) set.seed(1002) vowel_kls_talker_perm <- map(seq_len(n_perm), ~ future(vowel_data_grouped %>% filter(grouping == "Talker") %>% mutate(data = map(data, shuffle_groups_within, within="Vowel")) %>% train_models_grouped(category_col="Vowel", cue_cols=c("F1", "F2")) %>% bind_rows(vowel_marginal_models) %>% run_kl(reference_grouping = "Marginal", category_col = "Vowel") %>% mutate(perm_iter = .x))) %>% map(values) %>% flatten() %>% bind_rows() vot_marginal_models <- vot_by_place_grouped %>% filter(grouping == "Marginal") %>% train_models_grouped(category_col="voicing", cue_cols="vot") vot_kls_talker_perm <- map( seq_len(n_perm), ~ future(vot_by_place_grouped %>% filter(grouping == "Talker") %>% mutate(data = map(data, shuffle_groups_within, within="voicing")) %>% train_models_grouped(category_col = "voicing", cue_cols = "vot") %>% bind_rows(vot_marginal_models) %>% run_kl(reference_grouping = "Marginal", category_col = "voicing") %>% mutate(perm_iter = .x))) %>% map(values) %>% flatten() %>% bind_rows() ``` ```{r kl-perm-summary} kl_perm_summary <- bind_rows(vot_kls_perm, vowel_kls_perm, vot_kls_talker_perm, vowel_kls_talker_perm) %>% group_by(cues, contrast, grouping, perm_iter) %>% summarise(KL = mean(KL, na.rm=TRUE)) ``` ## Results {#sec:kl-results} I first report and discuss the broad patterns for the informativity of different grouping variables in the three vowel and stop voicing databases described above. In short, the results show first that there is more talker variability in vowels than stop voicing, and is reasonably consistent across two vowel corpora. Second, this talker variability is also _structured_ for vowels: grouping talkers according to gender, dialect, or the combination thereof leads to more informative groupings than random groupings of the same number of talkers. The same is not true for voicing. Finally, I illustrate how the proposed informativity measure captures well-documented dialectal variation in vowels. ```{r vowel-vot-kl-plot, dependson=c("grouping-and-colors", "heald-vowel-kl", "vot-kl", "vowel-kl", "kl-perm-summary"), fig.width=8.6, fig.height=6, fig.cap="Socio-indexical variables are more informative about cue distributions for vowel formants [HN15, @Heald2015; NSP, @Clopper2006b] than for stop voicing (VOT), even after Lobanov normalization. On top of this, more specific groupings (like Talker and Dialect+Gender) are more informative than broader groupings (Gender). Each open point shows one group (e.g., _male_ for _Gender_), while shaded points show the average over groups. Gray violins show the null distribution of average informativity (KL estimated from 1000 datasets with randomly permuted group labels), and stars show significance of the variable's average KL with respect to this null distribution (`*`: $p<0.05$, `**`: $p<0.01$, `***`: $p<0.001$)."} kl_by_group <- bind_rows(vot_kl, vowel_kl, vowels_heald_kl) %>% group_by(contrast, cues, grouping, group) %>% summarise(KL = mean(KL)) kl_summary <- kl_by_group %>% group_by(contrast, cues, grouping) %>% summarise(KL=mean(KL)) %>% left_join(kl_perm_summary %>% group_by(contrast, cues, grouping, perm_iter) %>% summarise(KL=mean(KL)), by=c("contrast", "cues", "grouping"), suffix = c("", "_perm")) %>% group_by(contrast, cues, grouping) %>% summarise(KL = unique(KL), perm_p = mean(KL <= KL_perm), n_perm = length(KL_perm)) %>% mutate(perm_p_stars = ifelse(is.na(perm_p), NA, p_val_to_stars(perm_p))) %>% ungroup() %>% prettier_grouping() %>% pretty_contrast() kl_by_group %<>% ungroup() %>% prettier_grouping() %>% pretty_contrast() cues_plus_contrast <- function(d) mutate(d, cues_contrast = paste(cues, contrast, sep="\n")) kl_summary %>% cues_plus_contrast() %>% ggplot(aes(x=grouping, y=KL, color=grouping)) + geom_violin(data = kl_perm_summary %>% ungroup() %>% prettier_grouping() %>% pretty_contrast() %>% cues_plus_contrast(), color=NA, fill="black", alpha=0.2, scale="width") + geom_quasirandom(data = kl_by_group %>% cues_plus_contrast, alpha=0.2, size=2) + geom_point(size=4) + geom_text(aes(label = perm_p_stars), nudge_y = 0.25, color="black") + facet_grid(.~cues_contrast, scales='free', space='free') + # facet_grid(.~cues+contrast, scales='free', space='free') + rotate_x_axis_labs() + labs(x = "", y = 'KL Divergence of cue distributions\nfrom marginal (bits)', color = "") + theme(plot.title=element_text(hjust=0)) + scale_color_grouping() + scale_y_continuous(breaks=seq(0,6)) + theme(legend.position=c(1, .95), legend.justification=c("right", "top"), legend.background=element_rect(fill=gray(1, alpha=0.5))) + guides(color = guide_legend(label.position="left", label.hjust=1)) ``` Figure {@fig:vowel-vot-kl-plot} shows the informativity of gender, dialect, and talker identity, as measured by the average KL divergence between cue distributions of each phonetic category conditioned on these factors from the overall (marginal) cue distributions. I make three observations. ```{r vowel-kl-same-dialect, cache=TRUE, dependson=c("nsp-data")} nsp_kl_same_dialect <- vowel_data_grouped %>% filter(grouping %in% c("Talker", "Dialect")) %>% train_models_grouped(category_col="Vowel", cue_cols=c("F1", "F2")) %>% run_kl(category_col="Vowel", reference_grouping="Dialect") %>% left_join(nsp_vows %>% group_by(Talker, Dialect) %>% summarise(), by=c(group="Talker")) %>% filter(reference_group == Dialect) talker_summary <- . %>% filter(grouping=="Talker") %>% group_by(group)%>% summarise(KL=mean(KL)) %>% do(daver::boot_ci(.$KL, function(d,i) mean(d[i], na.rm=TRUE))) %$% sprintf("(%.1f bits, 95%% CI [%.1f--%.1f])", observed, ci_lo, ci_high) talker_kl_strs <- list(same=nsp_kl_same_dialect, marg=vowel_kl, heald=vowels_heald_kl) %>% map_chr(talker_summary) ``` First, there are major differences in talker variability between vowels and stop consonant voicing: talker identity is an order of magnitude more informative about vowel distributions than about VOT distributions.[^two-cues] That is, knowing a talker's identity provides significantly more information about their vowel formant frequency distributions than it does about their VOT distributions. This quantitatively confirms the qualitative understanding that there is less talker variability in VOT than in formant frequencies [e.g., @Allen2003; @Lisker1964; vs. @Peterson1952; @Hillenbrand1995]. Strikingly, the _most_ informative variable for VOT---talker identity---is roughly as informative as the _least_ informative variable for any vowels (Gender for Lobanov-normalized F1×F2). [^two-cues]: This is true even when considering just F1 or F2 in isolation. The KL divergence for distributions of two, independent cues is the sum of the independent cues. For vowels, the F1×F2 informativity is approximately equal to the sum of the individual F1 and F2 informativities. Across the two vowel corpora, the level of talker variability appears to be lower in the HN15 data than the NSP data, but not so low as in the VOT data. One possible explanation of this discrepancy is that all the HN15 talkers are all from the same dialect region, while @Clopper2006b intentionally recruited talkers to demonstrate dialect variability. And, indeed, individual NSP talkers' distributions diverge less from the corresponding dialect distributions `r talker_kl_strs["same"]` than they do from the marginal distributions `r talker_kl_strs["marg"]`. But this divergence is still substantially more than the average for the HN15 talkers `r talker_kl_strs["heald"]`, suggesting that this is not the only explanation. Another possibility is that because of the smaller number of tokens from each NSP talker means that the individual talker distribution estimates are noisy. Unfortunately, without access to the underlying single-token F1×F2 values for HN15 talkers it is difficult to assess this. The informativity of gender, on the other hand, is similar across the two datasets. This suggests that the size of these datasets is sufficient to replicate estimates of informativity of gender. Second, with one notable exception, I find that grouping variables with fewer talkers are more informative than groupings with more talkers: talker identity is the most informative, followed by (for NSP vowels) dialect+gender and dialect, then gender and age. The one exception is that for _un_-normalized formants, gender is substantially more informative than dialect is, even though it is one of the most general grouping variables, with each group including half the talkers. This is to be expected: gender differences (either stylistic or physiological, like vocal tract length) change formant frequencies for all vowels by large amounts [@Johnson2006], while dialect variation is limited to certain dialect-vowel combinations [@Clopper2005; @Labov2006]. My third observation is about the effect of normalization. As is to be expected, Lobanov normalization substantially reduces the informativity of gender. This is, after all, one of the *purposes* of normalization---to remove differences between male and female vowel distributions that are due to the overall shifts in formant frequencies. However, gender still carries some information about Lobanov-normalized vowel distributions. This is in line with previous observations that Lobanov normalization---while among the most effective normalizations---is not perfect [e.g., @Escudero2007; @Flynn2011]. Additionally, there is still substantial talker variability even in normalized vowel distributions. This and the non-zero informativity of gender support arguments against (Lobanov) normalization as the sole mechanism by which listeners overcome talker variability [see @Johnson2005 for discussion]. Finally, even for normalized vowel distributions, the informativity of dialect and gender together is still higher than the informativity of dialect alone. This suggests that dialect differences themselves are modulated by gender [as noted by @Clopper2005]. ### Informativity and dialect variation One advantage of the proposed measure of informativity is that it can assess whether a grouping variable is equally informative about all categories, or whether a particular grouping is particularly informative about specific types of categories. Figure {@fig:vowel-vot-kl-plot} illustrates this for vowel, compared to stop, categories. In this section, I show how this same approach can be used to investigate difference in the informativity of a grouping factor for different _vowels_. This provides a principled quantitative measure of, for example, vowel-specific dialectal variation. If factors like dialect are differentially informative about the distributions of some vowels versus others, then listeners may track dialect-specific distributions for only some vowels and not for others. `r label("nsp-dialect2")` As Figure {@fig:vowel-kl-by-category} shows, informativity varies quite a bit by vowel. Dialect (and Dialect+Gender) is particularly informative for /ɑ/, /æ/, /ɛ/, and /u/, vowels with distinctive variants in at least one of the dialect regions from the NSP [see @Clopper2005 for a summary of variation in American English vowels across these dialect regions]. These results are consistent with what has been noted in the sociolinguistic literature [e.g., @Labov2006]: /ɑ/ is merged with /ɔ/ in some regions, /æ/, /ɛ/, and /ɑ/ participate in the Northern Cities Chain Shift, and /u/ is fronted in some regions [and in others only by female talkers; @Clopper2005]. ```{r vowel-kl-by-category, dependson=c("grouping-and-colors"), fig.width=7, fig.height=3.45, fig.cap='Individual vowels vary substantially in the informativity of grouping variables about their cue distributions. Only normalized F1×F2 is shown to emphasize dialect effects. Large dots show the average over dialects (+genders), while the small dots show individual dialects (+genders) (see Figure {@fig:vowel-kl-by-dialect} for detailed breakdown of individual dialect effects). The grey violins show the vowel-specific null distributions of the averages, estimated based on 100 datasets with randomly permuted dialect (+gender) labels, and stars show permutation test p value (proportion of random permutations with the same or larger KL divergence), with false discovery rate correction for multiple comparisons [@Benjamini1995].'} vowel_kl_by_dialect_vowel <- vowel_kl %>% filter(str_detect(cues, 'Lobanov'), str_detect(grouping, "Dialect")) %>% prettier_grouping() %>% group_by(cues, Vowel, grouping, group) %>% summarise(KL = mean(KL)) vowel_kl_perm_by_dialect_vowel <- bind_rows(vowel_kls_perm, vowel_kls_talker_perm) %>% filter(str_detect(cues, "Lobanov"), str_detect(grouping, "Dialect")) %>% prettier_grouping() %>% group_by(cues, Vowel, grouping, perm_iter) %>% summarise(KL=mean(KL, na.rm=TRUE)) vowel_kl_by_dialect_vowel_summary <- vowel_kl_by_dialect_vowel %>% summarise(KL = mean(KL)) %>% left_join(vowel_kl_perm_by_dialect_vowel, by = c("cues", "Vowel", "grouping"), suffix = c("", "_perm")) %>% group_by(cues, grouping, Vowel) %>% summarise(perm_p = (sum(KL <= KL_perm) + 0.5) / (length(KL) + 1), KL = mean(KL)) %>% mutate(perm_p_fdr = p.adjust(perm_p, method="fdr"), perm_p_lt = p_val_to_less_than(perm_p_fdr, cutoffs = c(0.05, 0.01)), perm_p_star = p_val_to_stars(perm_p_fdr)) vowel_kl_by_dialect_vowel %>% ggplot(aes(x=Vowel, y=KL, color=grouping)) + geom_point(alpha=0.2, position = position_jitter(w=0.2), show.legend=FALSE) + geom_point(stat='summary', fun.data=mean_cl_boot, size=2, ## position = position_dodge(w=0.5), show.legend=FALSE) + geom_text(data= vowel_kl_by_dialect_vowel_summary, aes(label = perm_p_star), color="black", nudge_y = 0.1) + labs(x = 'Vowel', y = 'KL Divergence (bits)') + ggtitle("Informativity by vowel", subtitle="Lobanov-normalized F1×F2") + theme(plot.title=element_text(hjust=0)) + geom_violin(data=vowel_kl_perm_by_dialect_vowel, scale="width", fill="black", color=NA, alpha=0.2) + facet_grid(.~grouping) + ylim(0, NA) + scale_color_grouping() ``` Figure @fig:vowel-kl-by-dialect shows the informativity by vowel and dialect individually. This shows that dialects do indeed vary in how informative they are, both overall (left) and by vowel (right). Some of this variability corresponds to known patterns of dialect variability. In particular, talkers from the North dialect region produce vowels---/æ/ and /ɑ/ in particular---with formant distributions that deviate markedly more from the marginal distributions (across all dialects) than any of the other dialects. Both of these vowels participate in the Northern Cities Shift, and in a sense are foundation of this shift, being at the root of the Northern Cities Shift's implicational hierarchy [@Labov2006; @Clopper2005]. `r label("implicational-hiearchy")` The Mid-Atlantic /ɑ/ is, like the Northern /ɑ/, non-merged with /ɔ/ [@Clopper2005] and hence deviates from the marginal /ɑ/ substantially. New England talkers produce a low-variance /u/ distribution with a lower mean F1 than other dialects, which may reflect a lack of /u/-fronting and is consistent with a conservative /u/ in New England [@Labov2006].[^rory] [^rory]: Thanks to Rory Turnbull for suggesting this interpretation. ```{r vowel-kl-dialect-summaries, dependson=c("grouping-and-colors")} dialect_lob_kl_perm_vals <- vowel_kls_perm %>% filter(grouping == "Dialect", str_detect(cues, "Lobanov")) %>% pull(KL) vowel_dialect_combos_kl_summary <- vowel_kl %>% filter(grouping=='Dialect', str_detect(cues, "Lobanov")) %>% select(group, Vowel, KL) %>% mutate(perm_p = map_dbl(KL, ~ mean(.x <= dialect_lob_kl_perm_vals)), stars = p_val_to_stars(perm_p), perm_p_fdr = p.adjust(perm_p, "fdr"), stars_fdr = p_val_to_stars(perm_p_fdr)) %>% group_by(Vowel) %>% mutate(min_perm_p_fdr = min(perm_p_fdr)) dialect_mean_kl_perm <- vowel_kls_perm %>% filter(grouping == "Dialect", str_detect(cues, "Lobanov")) %>% group_by(cues, contrast, grouping, group, perm_iter) %>% summarize(KL_perm = mean(KL)) dialect_mean_kl_perm_p <- dialect_mean_kl_perm %>% left_join(vowel_kl %>% group_by(cues, contrast, grouping, group) %>% summarize(KL=mean(KL))) %>% summarize(perm_p = (sum(KL <= KL_perm)+1)/(length(KL)+1), KL = mean(KL)) %>% mutate(perm_p_fdr = p.adjust(perm_p, "fdr"), stars = daver::p_val_to_stars(perm_p_fdr)) ``` `r label("r1-svs")` The only particularly high divergence identified as significant by permutation test that does not correspond to known sociolinguistic variants is Mid-Atlantic /e/, which is slightly higher and fronter than the marginal distribution. There are also well-documented dialect effects that appear to be missing from these results. For instance, none of the individual vowels involved in the Southern Vowel Shift---/i, ɪ, e, ɛ, o, u/---diverge from the marginal distributions reliably. However, as Figure @fig:vowel-kl-by-dialect (left) shows, the entire vowel space of Southern speakers _does_ diverge from the marginal distributions, suggesting that even though the individual vowels do not differ dramatically from marginal, the combination of subtle differences is in fact reliable across talkers. Moreover, the individual vowels that diverge the most for Southern speakers are /ɛ, u, ɔ, o, e/, all of which (except /ɔ/, which likely reflects the lack of the caught-cot merger) are associated with the Southern Vowel Shift by @Clopper2005 and all of which are significant before correcting for multiple comparisons (except for /ɛ/, $`r filter(vowel_dialect_combos_kl_summary, group=="South", Vowel=="ɛ") %>% pull(perm_p) %>% p_val_to_less_than()`$). Also, the lack of reliable evidence for individual Southern Vowel Shifts is consistent with the results from @Clopper2005 using the same data: mean F1 and F2 for Southern speakers for these vowels were not found to consistently differ significantly from the other dialects or the overall means (although there were some combinations of gender and dialect that did yield significant differences). ```{r vowel-kl-by-dialect, dependson=c("vowel-kl-dialect-summaries"), fig.width=10, fig.height=4.5, fig.cap='Breaking down the overall informativity of dialect by individual dialects (left) and dialect-vowel combinations (right). Some dialects are more informative about Lobanov-normalized vowel distributions than random groupings of the same number of talkers (grey violins), but some are not (at least in the current sample of talkers). Likewise for individual vowels within dialects. Moreover, dialects be informative on average but not have any individual vowels that are informative alone (e.g., South), and vice-versa (e.g., Midland). Stars show $p$ values from permutation test (`*`: $p<0.05$, `**`: $p<0.01$, `***`: $p<0.001$) corrected for false-discovery rate across all dialects/dialect-vowel combinations [@Benjamini1995].'} p_dialect_vowel <- vowel_dialect_combos_kl_summary %>% ggplot(., aes(x=group, y=KL)) + geom_line(aes(group=Vowel, color=Vowel, alpha=min_perm_p_fdr < 0.05)) + geom_text(data = . %>% filter(perm_p_fdr < 0.05), aes(label = Vowel, color=Vowel), show.legend=FALSE, nudge_x = 0.28, nudge_y=0.05) + geom_text(aes(label = stars_fdr, color=Vowel), nudge_y=0.05, show.legend=FALSE) + rotate_x_axis_labs() + scale_alpha_manual(values=c(0.3, 1), guide=FALSE) + labs(x = NULL, y = NULL) + ggtitle("Informativity by dialect and vowel") + theme(plot.title=element_text(hjust=0)) + geom_violin(data = vowel_kls_perm %>% filter(grouping == "Dialect", str_detect(cues, "Lobanov")), aes(group=group), color=NA, fill="black", alpha=0.2) + lims(y=c(0,2)) + theme(axis.text.y = element_blank(), axis.line.y = element_blank(), axis.title.y= element_blank(), axis.ticks.y= element_blank()) + # c11_2 from notebook: # distinguishable_colors(11, lchoices = [60, 70], cchoices = [100], hchoices = linspace(0, 300, 20)) scale_color_manual(values = c("#FF0095", "#00C928", "#00B0FF", "#FF8500", "#A39200", "#00D2C7", "#7A78FF", "#FF1D3A", "#00B37B", "#00D2F7", "#FF4D00")) p_dialect <- ggplot(data = dialect_mean_kl_perm, aes(x=group, y=KL)) + geom_violin(data = dialect_mean_kl_perm, aes(group=group, y=KL_perm), color=NA, fill="black", alpha=0.2) + geom_line(data = dialect_mean_kl_perm_p, color="black", group=1) + geom_text(data = dialect_mean_kl_perm_p, aes(label=stars), color="black", nudge_y = 0.05) + lims(y=c(0,2)) + labs(x = NULL, y = 'KL divergence (bits)') + ggtitle("Informativity by dialect", subtitle="Lobanov-normalized F1×F2") + theme(plot.title=element_text(hjust=0)) + rotate_x_axis_labs() plot_grid(p_dialect, p_dialect_vowel, rel_widths = c(1, 1.1), axis="l", align="hv") ``` This asymmetry in informativity across both dialects and vowels raises the question of how listeners adapt to variation across categories and cue dimensions. All else being equal, a listener should be more confident in their prior beliefs about a category that varies less across talkers, and hence adapt less flexibly [@Kleinschmidt2015]. But it is not clear at what level listeners track variability for the purpose of determining how quickly to adapt. For instance, as we have seen, vowels overall vary substantially more across talkers than stop categories, but there are differences in how much individual vowels vary. It remains to be seen whether listeners adapt to all vowels with the same degree of flexibility, or are sensitive to these vowel-specific differences in cross-talker variability. ## Discussion `r label("discussion1")` The measure of informativity I have proposed here quantifies the _amount_ and _structure_ of talker variability using an information-theoretic measure of how much talker- or group-specific cue distributions diverge from the overall (marginal) distributions. This measure allows talker variability for different phonetic categories, and even different cues, to be compared directly. As a proof of concept, the results here quantify previous qualitative findings[^qualitative] that in American English there is an order of magnitude less talker variability in the realization of word-initial stop voicing than in vowels. Moreover, there are qualitative differences in the _structure_ of this variability: gender is no more informative about VOT distributions than random groupings of talkers of the same size, while gender-specific F1×F2 distributions are reliably more informative than random groupings. Informativity also allows fine-grained investigation of dialect variation. The same measure can be applied at the level of individual phonetic categories (e.g., vowels, @fig:vowel-kl-by-category), groups (e.g., a particular dialect), or even particular combinations the two (as in Figure @fig:vowel-kl-by-dialect). This measure takes into account the entire distribution of cues, and so it is more comprehensive than standard statistical techniques like regression or ANOVAs which usually compare the _mean_ values of particular cues across groups or categories [for a comparable analysis of the NSP data, see @Clopper2005]. The usefulness of this measure does not come at the expense of grounding in first principles: it corresponds directly to the amount of information that a listener leaves on the table if they ignore a grouping variable (including talker identity) and treat all tokens of a phonetic category as generated from the same underlying distribution. Ideal listener models [@Clayards2008; @Norris2008; @Feldman2009] identify knowledge of these distributions as a fundamental constraint on accurate and efficient speech perception. Furthermore, the ideal adapter model [@Kleinschmidt2015] motivates the use of talker- or group-specific cue distributions as a constraint on the ability of listeners to effectively generalize from previous experience: if a grouping variable like talker identity or gender is not informative about cue distributions, then there is little possible benefit to tracking group-specific distributions. However, just because a grouping variable _is_ informative about cue distributions does not necessarily mean that tracking those group-specific distributions leads to any benefit for recognizing a talker's intended category. This motivates the notion of _utility_, investigated in study 2. [^qualitative]: I refer to previous evidence for more talker variability in vowels than stop voicing as "qualitative" because no attempt has been made to measure talker variability in a directly comparable way across the two systems, even though there have been quantitative measurements of talker variability in each system. # Study 2: How _useful_ are socio-indexical groups for recognizing vowels and stop voicing `r label("r2-intuitive-utility")` The results of Study 1 show that socio-indexical variables like age, gender, dialect, and talker identity are informative about phonetic cue distributions. That is, the category-specific distributions of acoustic-phonetic cues are reliably different for differing values of at least some socio-indexical variables. However, these differences in cue distributions do not necessarily correspond to differences in the ability to recover a talker's intended phonetic category. Even if there is some structure in talker variability for listeners to learn, that learning might not be useful for speech recognition. This motivates the notion of _utility_ that I develop and explore in Study 2. Where informativity concerns how well a listener could probabilistically predict the _cues_ themselves, utility measures how well a listener could use those cue distributions to infer a talker's intended phonetic _category_. A socio-indexical variable must be informative for it to be more useful than the overall, marginal distributions, but the converse is not necessarily true. For example, if talkers vary in a way that does not lead the marginal distributions of different phonetic categories to overlap much more than for individual talkers, then the inferences that an ideal listener would draw based on the marginal distributions are essentially the same as from the talker-specific distributions. ## Methods The _utility_ of a socio-indexical grouping variable is defined based on how often an ideal listener would successfully recognize a talker's intended category, given cue distributions estimated based on a particular group of talkers ($g$). Specifically, I use the posterior probability of the talker's intended category $c_\mathrm{intended}$ given the cue value they actually produced $x$.[^decision-rule] This, in turn, depends on the cue distributions produced by group $g$ as described by Bayes' rule: $$ p(c=c_\mathrm{intended} | x, g) \propto p(x|c=c_\mathrm{intended}, g)p(c=c_\mathrm{intended}) $$ Bayes Rule can, with some algebra, be restated as an equality of odds ratios: $$ \frac{p(c=c_\mathrm{intended} | x, g)}{p(c \ne c_\mathrm{intended} | x, g)} = \frac{p(x|c=c_\mathrm{intended}, g)}{p(x|c \ne c_\mathrm{intended}, g)} \times \frac{p(c=c_\mathrm{intended})}{p(c \ne c_\mathrm{intended})} $$ Like the standard form of Bayes Rule, this has a straightforward interpretation: the _posterior odds_ of correctly recognizing $c=c_\mathrm{intended}$ are the _prior odds_ times the likelihood ratio, or how much more likely it is that $x$ was generated by the true category $c_\mathrm{intended}$ than all the other categories combined. If the likelihood ratio is greater than 1, then we have gained evidence in favor of the true category; and if it is less than 1, we have gained evidence in favor of an erroneous category. This interpretation holds regardless of whether there is contextual information that favors one category over another, which would only change the prior odds. It is also not sensitive to the number of categories, which would also manifest in changes in the prior odds. Moreover, if we take the logarithm of both sides, the prior and likelihood log-odds ratios _add_ together to produce the posterior log-odds.[^additional-cues] Thus, the log-likelihood ratio $$ \log\left(\frac{p(x|c=c_\mathrm{intended}, g)}{p(x|c \ne c_\mathrm{intended}, g)}\right) $$ provides a measure of the information gained[^information] about what a talker is trying to say by interpreting cues $x$ using the category-specific cue distributions of group $g$, rather than a category-specific cue distribution. It can be calculated from the posterior probability of the correct (talker's intended) category relative to chance: \begin{align} \log\left(\frac{p(x|c=c_\mathrm{intended}, g)}{p(x|c \ne c_\mathrm{intended}, g)}\right) &= \log\left(\frac{p(c=c_\mathrm{intended} | x, g)}{p(c \ne c_\mathrm{intended} | x, g)}\right) - \log\left( \frac{p(c=c_\mathrm{intended})}{p(c \ne c_\mathrm{intended})} \right) \\ &= \log\left(\frac{\mathrm{accuracy}}{1-\mathrm{accuracy}}\right) - \log\left( \frac{\mathrm{chance}}{1-\mathrm{chance}} \right) \end{align} [^decision-rule]: An ideal observer's actual _responses_ (and thus its accuracy) in, e.g., a phonetic classification task additionally depend on the decision rule (or loss function). However, any reasonable decision rule will be constrained by the amount of evidence in favor of the talker's intended category, and so the posterior probability of that category is a reasonable proxy for the current purpose. Also, note that using a winner-take-all decision rule with likelihood derived from normal distributions is the same as quadratic discriminant analysis, as used for instance by @Adank2004 in assessing the effectiveness of various vowel normalization techniques. [^additional-cues]: This is true even in the presence of additional (independent) cues. [^information]: This quantity is not exactly information in the information-theoretic sense because it's not weighted by the probability of observing cue $x$ under the true category model. By comparing information gain from different groups' distributions, we can estimate the _utility_ of these different groupings. For instance, we can ask how much additional information is gained by knowing a talker is male by looking at the information gain from cue distributions estimated from other male talkers ($g=\mathrm{male}$), compared to all male and female talkers together ($g=\mathrm{all}$). The same approach can also address changes in the _prior_ probability of a category based on socio-indexical variables (e.g., higher or lower frequency of voiced stops in a particular dialect) Talker-specific cue distributions ought to provide the most information about a talker's own productions, and the marginal cue distributions (over all talkers) the least. The difference between them, though, depends on the amount of talker variability. I expect other groupings to yield information gains that are somewhat less than talker-specific distributions, but more than marginal distributions. Where exactly between these extremes is a measure of how much utility there is in tracking group-specific cue distributions: if a listener gains just as much information about what a talker was trying to say by using cue distributions based on other talkers of the same gender, age, dialect, etc., then there is little need to learn talker-specific cue distributions. Where the informativity of a particular grouping (Study 1) measures how much there is to learn about group-specific distributions, the utility of the grouping (Study 2) measures how much benefit a listener would gain from doing that learning. For vowels, I classified vowel categories directly. For voicing, the only cue `r label("r1-only-cue-vot")` available in this dataset is VOT, which does not (reliably) distinguish place of articulation. Thus, I classified voicing separately for each place of articulation, and then average the resulting accuracy. ### Assumptions Utility measures the maximum possible improvement in the accuracy of speech perception that is possible under the specific set of assumptions made in the ideal observer model. One particularly important assumption that this method makes is that the listener _knows_ the socio-indexical group for a talker. I make this assumption for two reasons. First, in many cases listeners do, in fact, have a good deal of socio-indexical information about a talker. This may come from non-linguistic cues (or world knowledge), or even from other linguistic features that the listener produces [@KleinschmidtInPress2017]. Moreover, this assumption is not inherent in the method I propose, and it is possible to simultaneously infer the intended category _and_ the socio-indexical group. In preliminary simulations, defining utility in this way has surprisingly little effect on the results, but it makes the simulations substantially more computationally demanding. Second, and more importantly, I define utility assuming that the socio-indexical group of a talker is known because this provides an estimate of the _in-principle_ benefit of tracking group-specific phonetic cue distributions.[^startingpoint] This is a defining feature of rational analyses of cognition [for the value of such clearly defined, in-principle bounds on performance, see also @Massaro1990]. [^startingpoint]: This benefit is for first encountering a novel talker from a socio-indexical group *prior to further adaptation*. In the general discussion, I return to this point and why the utility measure might *underestimate* the benefit of implicit knowledge about group-specific category distributions, as this knowledge likely serves as the starting point for talker-specific adaptation [@Kleinschmidt2015]. ### Procedure The utility of a grouping variable (e.g., gender) is calculated by first calculating the utility of that variable for each talker, which is done as follows. First, a training data set is constructed. For the NSP data, this was done by sampling three other talkers from the same group (e.g., three other male talkers). This subsampling is done to avoid biases in accuracy from group size, since groups with fewer talkers have more unstable estimates of their cue distributions, and lower accuracy on average [see @James2013, Section 2.2.2]. The most specific grouping in the NSP is Dialect+Gender, which has four talkers per group. Including the talker's own test data in the training data set will also artificially increase accuracy [@James2013, Section 5.1], so three talkers are used to form the training set. For the other datasets, all other talkers from the same group were used for the training set, since the VOT data is (approximately) balanced by age and gender, and the HN15 data only groups talkers by gender. Based on this training data set, category-specific distributions are estimated in the same way as study 1, using the unbiased estimators of the mean and (co-)variance of the tokens from each phonetic category. Second, the overall accuracy for the test talker is determined in the following way. Bayes rule is used to compute the posterior probability of the talker's intended category for each of the tokens produced by the test talker, using the likelihood functions of each category from the training data. The mean of these posterior probabilities is the talker's overall accuracy. Third, and finally, the accuracy $p$ for each talker is converted to utility by transforming to log-odds $\log(p/(1-p))$ and subtracting the log-odds of responding correctly by guessing uniformly, which is $\log(1/(n-1))$ if there are $n$ response options. The overall utility of the grouping variability is the mean of these talker-specific utilities. Because the training sets are sampled at random for groupings except for dialect+gender, the whole procedure is repeated 100 times and averaged at the level of talker-specific utility to obtain more reliable results. For talker as a grouping-variable, six-fold cross-validation was used instead, where each talker's tokens were divided up into 6 roughly equal partitions (within category). The accuracy for each partition's tokens was determined using the other 5 as training data. Bootstrap resampling was used to estimate the reliability of these estimates. 1000 simulated populations of talkers were sampled with replacement, and the average utility for each grouping variable, and the differences between them, were re-computed each time. The reliability of differences between, for example, the utility of dialect and gender can be estimated in this way by looking at how frequently the resampled populations result in a difference in utility between dialect and gender with the same sign as the real sample of talkers. This is similar to a paired $t$-test but does not assume that talkers' utilities are normally distributed. Because the vowel corpus from @Heald2015 only includes summary statistics, I computed utility based on a sample 100 F1×F2 pairs per category for each talker. Differences in the composition of these corpora mean that care must be taken in making comparisons _across_ corpora. The group-size bias is especially problematic when looking for subtle effects of groupings with small sample sizes, like dialect or dialect+gender (which contain 8 and 4 talkers per group, respectively). The subsampling procedure results in changes in accuracy of only a few percentage points, but doesn't change the overall order of magnitude. Thus, gross, qualitative comparisons across corpora are still reasonable, even if fine-grained comparisons are not. ```{r classification-helpers, cache=TRUE} ## 1. Likelihood of each token under each vowel for each dialect model #' Compute posterior vowel category conditional on group #' #' Applies classify_vowels to test data for each group_model. #' #' @param data_test test data to calculate posteriors for #' @param group_models named list of group models, each of which is a named list #' of vowel models #' @return data frame with one row per data_test row x group x vowel model, with #' added columns group_model (name of group model), vowel_model (name of vowel #' model), lhood p(x | vowel_model, group_model), posterior (p(vowel_model | #' x, group_model)). compute_category_post_given_group <- function(data_test, group_models) { group_models %>% map(~ unlist_models(., 'category')) %>% map(~ classify(data_test, ., 'category')) %>% data_frame(group_model=names(.), x=.) %>% unnest(x) %>% rename(category_model=model) } #' Combine category | group posteriors with group posteriors #' #' @param group_category_posteriors category posterior probabilities conditional #' on group, in the form of a data frame with at least columns category_model, #' group_model, and posterior (e.g., output of #' compute_category_post_given_group) #' @param group_posterior marginal group posterior probabilities, in the form of #' a data frame with columns group_model and log_posterior (e.g., output of #' compute_group_marginal_posterior) #' @return a data frame with the joint posterior of category category and group, #' in posterior and log_posterior. #' compute_joint_category_group_post <- function(group_category_posteriors, group_posteriors) { group_posteriors %>% select(group_model, group_log_posterior=log_posterior) %>% inner_join(group_category_posteriors, by = 'group_model') %>% mutate(log_posterior = log(posterior) + group_log_posterior, posterior = exp(log_posterior)) } #' Compute joint indexical-linguistic posterior #' #' @param trained data frame with \code{models} and \code{data_test} (as #' produced by \code{\link{train_models_indexical_with_holdout}}). #' @param obs_vars quoted names of columns in test data that together indentify #' a single observation (e.g., \code{c('Vowel', 'Token')}) #' @return a data frame with one observation per combination of group (e.g., #' Dialect), category (e.g. "ae"), and row in the ORIGINAL, un-nested data #' set, with new columns \code{group_model}, \code{category_model}, #' \code{lhood}, \code{posterior}, and \code{log_posterior}. Posterior #' probabilities sum to 1 within each cross-validation fold (e.g., Talker) + #' observation (e.g., Vowel+Token) combination, over all values of category #' and group. #' trained_to_joint_post <- function(trained, obs_vars) { trained %>% mutate(conditional_posteriors = map2(data_test, models, compute_category_post_given_group), group_posteriors = map(conditional_posteriors, . %>% group_by_(.dots=obs_vars) %>% mutate(log_lhood = log(lhood)) %>% marginalize_log('log_lhood', 'group_model') %>% ungroup() %>% aggregate_log_lhood('log_lhood', 'group_model') %>% normalize_log_probability('log_lhood')), joint_posteriors = map2(conditional_posteriors, group_posteriors, compute_joint_category_group_post)) %>% unnest(joint_posteriors) } #' @param d data frame #' @param holdout Column defining cross validation folds #' @param ... additional arguments passed to \code{\link{train_models}}. classify_by_talker_cv <- function(d, holdout='Token', category='Vowel', ...) { train <- partial(train_models, grouping=category, ...) d %>% phondisttools::train_test_split(holdout=holdout) %>% mutate(models_trained = map(data_train, . %>% group_by(Talker) %>% train()), models_tested = map2(data_test, models_trained, classify, category=category)) %>% unnest(models_tested) %>% mutate(grouping = 'Talker', group_is = 'Known', group = Talker) %>% rename(category_model = model) } ``` ```{r vowel-classification-models-group-known, cache=TRUE, dependson=c('vowel-data', 'classification-helpers')} ## we need posterior prob of true category to be able to compute the likelihood ## ratio metric so collect it here. acc_method <- "posterior" min_talker_per_group <- function(d) { d %>% do(n = length(unique(.$Talker))) %>% select_('n') %>% unlist() %>% min() } ## classify and get accuracy ## don't care about groups within train/test split (already restricted to same ## group) so we can just use train_models and classify directly train_test_acc <- function(data_train, data_test, category, ...) { data_train %>% train_models(grouping=category, cues=c("F1", "F2")) %>% rename_(category=category) %>% classify(data_test, ., 'category') %>% rename_(category_model = 'model') %>% get_accuracy(category_col = category, method=acc_method) } subsample_if_needed <- function(d, n_reps, holdout) { assert_that(nrow(d) == 1) if (d$group_size > d$subsample_size+1) { map(seq_len(n_reps), ~ mutate(d, data_train = map2(data_train, subsample_size, sample_n_groups, group=holdout), resamp_fold = .x)) %>% lift(bind_rows)() } else { mutate(d, resamp_fold = 1L) } } cluster <- create_cluster() %>% cluster_library(c("tidyverse", "phondisttools", "assertthat")) %>% cluster_copy(train_test_acc) %>% cluster_copy(subsample_if_needed) %>% cluster_copy(acc_method) set.seed(100) # randomized stuff happends before partition... subsample_sizes <- c(3) vowel_acc_same_group_rep <- vowel_data_grouped %>% mutate(group_size = map_dbl(data, min_talker_per_group)) %>% right_join(cross_d(list(group_size = unique(.$group_size), subsample_size = subsample_sizes)) %>% filter(group_size > subsample_size)) %>% # separate out data by groups prior to doing train/test splits unnest(map2(data, grouping, ~ nest(.x) %>% rename_(group=.y))) %>% unnest(map(data, train_test_split, holdout="Talker")) %>% ungroup() %>% mutate(n_ = 1:n()) %>% # can't use rowwise() because it makes do extract list elements partition(cluster=cluster) %>% group_by(n_) %>% # downsample training sets where necessary do(subsample_if_needed(., n_reps=100, holdout="Talker")) %>% mutate(acc = map2(data_train, data_test, train_test_acc, category="Vowel")) %>% collect() %>% unnest(acc) ``` ```{r vowel-classification, cache=TRUE, dependson=c('classification-helpers')} cluster <- cluster_copy(cluster, classify_by_talker_cv) vowel_talker_class <- vowel_data %>% partition(cluster=cluster) %>% mutate(acc = map(data, classify_by_talker_cv, holdout='Token', category="Vowel",cues=c("F1", "F2"))) %>% collect() %>% unnest(acc) %>% mutate(group_is = 'Known') ``` ```{r heald-vowel-classifier, cache=TRUE, dependson=c("classification-helpers", "heald-data")} heald_models n_heald_token_samples <- 100 heald_samples <- heald_models %>% filter(grouping == "Talker") %>% pull(models) %>% first() %>% unnest(model %>% map(r_model, n=n_heald_token_samples) %>% map(as_data_frame)) %>% mutate(Talker = group, Gender = ifelse(Talker %in% male_speakers, "m", "f"), Marginal = "all", Token = 1:n()) heald_vowel_class <- heald_models %>% mutate(data = grouping %>% map(~ mutate_(heald_samples, group=.x) %>% group_by(group) %>% nest()), models = models %>% map(group_by, group) %>% map(nest, .key="model"), ## map(~ mutate(.x, group_models=map(group_models, list_models, names_col="Vowel"))) %>% classified = map2(models, data, left_join) %>% map(~ map2(.x$data, .x$model, classify, category="Vowel")) %>% map(lift(bind_rows)) ) %>% unnest(classified) ``` ```{r vot-classification-models, cache=TRUE, dependson=c('vot-data', 'classification-helpers')} set.seed(101) vot_models <- vot_by_place_grouped %>% filter(grouping != 'Talker') %>% mutate(trained = map2(data, grouping, train_models_indexical_with_holdout, category = 'voicing', cues = 'vot'), joint_posteriors = map(trained, trained_to_joint_post, obs_vars = c('voicing', 'Token'))) ``` ```{r vot-classification, cache=TRUE, dependson=c('vot-classification-models', 'classification-helpers')} vot_joint_class <- vot_models %>% unnest(map2(joint_posteriors, grouping, ~ rename_(.x, group=.y))) %>% group_by(place, cues, grouping) vot_marginal_class <- vot_joint_class %>% group_by(Talker, voicing, Token, group, category_model, add=TRUE) %>% marginalize_log('log_posterior') %>% normalize_log_probability('log_posterior') %>% mutate(group_is = 'Inferred') vot_true_group_class <- vot_joint_class %>% filter(group == group_model) %>% group_by(Talker, voicing, Token, add=TRUE) %>% normalize_log_probability('log_posterior') %>% mutate(group_is = 'Known') vot_talker_class <- vot_by_place %>% unnest(map(data, . %>% group_by(Talker, voicing) %>% mutate(split=ntile(runif(length(Talker)), 6)) %>% classify_by_talker_cv(holdout='split', category='voicing', cues='vot') ) ) vot_class <- bind_rows(vot_marginal_class, vot_true_group_class, vot_talker_class) ``` ```{r check-classification, results='hide', cache=TRUE, dependson=c('vot-classification')} vot_class %>% group_by(cues, place, group_is, grouping, Talker, voicing, Token) %>% summarise(n_choice = sum(posterior_choice), sum_post = sum(posterior)) %$% assert_that(all(n_choice == 1), all.equal(sum_post, rep(1, length(sum_post)))) ``` ```{r classification-accuracy} chance_acc <- tribble( ~contrast, ~chance_acc, "Vowels (NSP)", 1/11, "Vowels (HN15)", 1/7, "Stop voicing", 1/2 ) vot_accuracy <- vot_class %>% get_accuracy('voicing', method=acc_method) %>% mutate(accuracy = as.double(accuracy), contrast = "Stop voicing") vowel_talker_acc <- vowel_talker_class %>% get_accuracy('Vowel', method=acc_method) %>% mutate(accuracy = as.double(accuracy)) vowel_accuracy <- vowel_acc_same_group_rep %>% filter(subsample_size==3) %>% group_by(cues, contrast, grouping, subsample_size, group, Talker, Vowel, Token) %>% summarise(accuracy = mean(accuracy)) %>% bind_rows(mutate(vowel_talker_acc, accuracy = as.double(accuracy))) %>% mutate(group_is = 'Known') heald_vowel_acc <- heald_vowel_class %>% rename(category_model = model) %>% get_accuracy("Vowel", acc_method) %>% mutate(accuracy = as.double(accuracy), group_is = "Known") logodds <- function(x) log(x) - log(1-x) accuracy <- vowel_accuracy %>% select(-Age) %>% bind_rows(vot_accuracy, heald_vowel_acc) %>% left_join(chance_acc) %>% ungroup() %>% prettier_grouping() accuracy_by_talker <- accuracy %>% filter(group_is == "Known") %>% group_by(contrast, cues, grouping, group, group_is, chance_acc, Talker) %>% summarise(accuracy = mean(accuracy), n = n()) %>% mutate(accuracy_logodds = logodds(accuracy) - logodds(chance_acc)) ## summary for each contrast/cues/grouping level, bootstrapped by talker accuracy_summary <- accuracy_by_talker %>% group_by(contrast, cues, grouping, group_is) %>% do(daver::boot_ci(.$accuracy, function(d,i) logodds(mean(d[i])))) %>% left_join(chance_acc) %>% rename(accuracy = observed, accuracy_low = ci_lo, accuracy_high = ci_high) %>% mutate_at(vars(starts_with("accuracy")), funs(.-logodds(chance_acc))) accuracy_summary_perc <- accuracy_by_talker %>% group_by(contrast, cues, grouping, group_is) %>% do(daver::boot_ci(.$accuracy, function(d,i) mean(d[i]))) %>% left_join(chance_acc) %>% rename(accuracy = observed, accuracy_low = ci_lo, accuracy_high = ci_high) ``` ```{r talker-advantage-acc, eval=FALSE} ## Talker advantage ## (TODO: incorporate) ## TODO: fix me talker_advantage_acc <- accuracy_by_talker %>% ## group_by(cues, contrast, grouping, Talker) %>% ## summarise(accuracy=mean(accuracy)) %>% rename(Talker_=Talker) %>% group_by(cues, contrast, grouping) %>% spread(grouping, accuracy) %>% gather(comparison, accuracy, -cues, -contrast, -Talker, -Talker_) %>% filter(!is.na(accuracy)) %>% mutate(talker_advantage = Talker - accuracy) %>% group_by(contrast, cues, comparison) %>% do({ boot_ci(.$talker_advantage, function(d,i) mean(d[i], na.rm=TRUE), h0=0) }) %>% filter(is.finite(observed)) ``` ```{r acc-pairwiwse-boot, cache=TRUE, dependson="classification-accuracy"} ## compute all pairwise differences in (log-odds) accuracy accuracy_by_talker_pairwise <- accuracy_by_talker %>% group_by(contrast, cues, grouping) %>% summarise() %>% nest() %>% unnest( # generate pairs of groupings data %>% map(pull, grouping) %>% map( ~ cross_df(list(to=., from=.))) %>% map(filter, as.numeric(from)% # join in data for from and to left_join(accuracy_by_talker, by=c(from="grouping", "cues", "contrast")) %>% left_join(accuracy_by_talker, by=c("contrast", "cues", "group_is", "chance_acc", "Talker", to="grouping"), suffix=c("_from", "_to")) accuracy_by_vowel_talker_pairwise <- accuracy_by_talker %>% group_by(contrast, cues, grouping) %>% summarise() %>% nest() %>% unnest( # generate pairs of groupings data %>% map(pull, grouping) %>% map( ~ cross_df(list(to=., from=.))) %>% map(filter, as.numeric(from)% # join in data for from and to left_join(accuracy_by_talker, by=c(from="grouping", "cues", "contrast")) %>% left_join(accuracy_by_talker, by=c("contrast", "cues", "group_is", "chance_acc", "Talker", to="grouping"), suffix=c("_from", "_to")) accuracy_diffs_boot <- accuracy_by_talker_pairwise %>% group_by(contrast, cues, from, to, group_is) %>% do( daver::boot_ci(., function(d,i) logodds(mean(d$accuracy_to[i])) - logodds(mean(d$accuracy_from[i])), h0=0) ) ``` ## Results {#sec:recog-results} First, I report and discuss the overall utility of the different grouping variables for the stop voicing and the two vowel databases I used. Second, I discuss the effect of vowel normalization on utility. Third and finally, I examine how utility varies across dialects and individual vowels ```{r overall-accuracy-group-known, dependson=c("grouping-and-colors"), fig.width=9.8, fig.height=6.4, fig.cap='Average information-gain in log-odds relative to chance (top) measures the utility of each grouping variable. Bottom shows posterior probability of correct category for comparison. Small points show individual talkers. Large points and lines show mean and bootstrapped 95% CIs over talkers (see text for details).'} # calculate layout of bars and stars min_offset <- 0.1 nudge <- 0.1 maxes <- accuracy_by_talker %>% group_by(contrast, cues, grouping) %>% summarise(max_acc = max(accuracy_logodds)) accuracy_diffs_boot %>% filter(boot_p < 0.05) %>% left_join(maxes, by=c("contrast", "cues", from="grouping")) %>% left_join(maxes, by=c("contrast", "cues", to="grouping")) %>% mutate(max_acc = max(max_acc.x, max_acc.y), max_acc.x = NULL, max_acc.y = NULL) %>% ## sort by the max of the underlying data points within each facet group_by(contrast, cues) %>% arrange(max_acc, .by_group=TRUE) %>% ## add a slight offset mutate(offset = max_acc - lag(max_acc, default=0), adj_max = cumsum(ifelse(offset > min_offset, offset, min_offset))+nudge) -> acc_diff_boot_bars p_percentage <- ggplot(accuracy_by_talker %>% cues_plus_contrast(), aes(x=grouping, y=accuracy, color=grouping)) + geom_quasirandom(alpha=0.2) + geom_pointrange(data=accuracy_summary_perc %>% cues_plus_contrast(), aes(ymin=accuracy_low, ymax=accuracy_high)) + facet_grid(.~cues_contrast, scales="free_x", space="free_x") + labs(x="Grouping", y="Accuracy\n(post. prob.)") + rotate_x_axis_labs() + theme(strip.background = element_blank(), strip.text=element_blank()) + background_grid(major="xy") + scale_color_grouping() p_logodds <- accuracy_summary %>% filter(group_is == "Known") %>% cues_plus_contrast() %>% ggplot() + geom_quasirandom(data = accuracy_by_talker %>% cues_plus_contrast(), aes(x = grouping, y=accuracy_logodds, color = grouping), alpha=0.2)+ geom_pointrange(aes(x = grouping, y=accuracy, ymin=accuracy_low, ymax=accuracy_high, color = grouping), stat='identity')+ geom_segment(data=acc_diff_boot_bars %>% cues_plus_contrast(), aes(x=from, xend=to, y=adj_max, yend=adj_max, alpha=boot_p))+ facet_grid(. ~ cues_contrast, scales='free_x', space='free_x') + scale_alpha(range=c(1, 0), limits=c(0, 0.05)) + labs(y = 'Information gain over random guessing\n(log-odds relative to chance)', alpha = "Bootstrapped\np-value", color = "Grouping\nvariable") + lims(y=c(0,NA)) + theme(plot.title=element_text(hjust=0), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + background_grid(major="xy") + scale_color_grouping() plot_grid(p_logodds + theme(legend.position="none"), p_percentage + theme(legend.position="none"), align='v', ncol=1, rel_heights=c(1.6,1)) %>% plot_grid(., get_legend(p_logodds), rel_widths=c(0.8, 0.2)) ``` Utility can be measured with respect to a number of baselines. First, by measuring information gain relative to chance performance (random guessing), we get a measure of the absolute utility of a particular socio-indexical grouping. This measure is plotted in Figure @fig:overall-accuracy-group-known. All grouping variables---even the marginal grouping which considers all talkers together---provide _some_ information gain over random guessing, between 2 and 4 log-odds. Moreover, marginal distributions for vowels (with un-normalized F1×F2) and stop voicing show similar amounts of information gain over random guessing, despite different numbers of categories and cues and very different levels of overall accuracy (Figure @fig:overall-accuracy-group-known, bottom panel). This suggests that information gain could be a useful metric for utility across different phonetic categories and cues. Second, by comparing information gain between different grouping variables, we get a measure of _relative_ utility, or how much additional information a listener would gain about the talker's intended category by tracking (and using) these distributions. As expected, within each contrast/cue combination, the marginal cue distributions (from all talkers) provide the least information gain, while talker-specific distributions provide the most. ```{r info-gain-summaries} format_advantage <- function(d, ci_descrip='95% CI', p=TRUE, paren=TRUE) { adv_string <- sprintf('%.2f', d$observed) ci_string <- sprintf('%s [%.2f--%.2f]', ci_descrip, d$ci_lo, d$ci_high) p_string <- paste(',', daver::p_val_to_less_than(d$boot_p)) if (paren) paste0(adv_string, ' (', ci_string, ifelse(p, p_string, ''), ')') else paste0(adv_string, ', ', ci_string, ifelse(p, p_string, '')) } hz_marg_talker <- accuracy_diffs_boot %>% filter(cues == "F1×F2 (Hz)", from == "Marginal", to == "Talker") lob_marg_talker <- accuracy_diffs_boot %>% filter(cues == "F1×F2 (Lobanov)", from == "Marginal", to == "Talker") hz_marg_gender <- accuracy_diffs_boot %>% filter(cues == "F1×F2 (Hz)", from == "Marginal", to == "Gender") vot_marg_talker <- accuracy_diffs_boot %>% filter(cues == "VOT", from == "Marginal", to == "Talker") hz_marg_dialect <- accuracy_diffs_boot %>% filter(cues == "F1×F2 (Hz)", from == "Marginal", to == "Dialect") lob_marg_dialect <- accuracy_diffs_boot %>% filter(cues == "F1×F2 (Lobanov)", from == "Marginal", to == "Dialect") ``` Despite similar levels of utility for marginal distributions, vowels and stop voicing show very different levels of utility for group- or talker-specific distributions. For voicing (VOT), there is minimal---if any---additional benefit to using cue distributions from more specific groupings; only talker-specific distributions provide any additional information gain over marginal, and this gain is small (log-odds of `r format_advantage(vot_marg_talker, p=FALSE, paren=FALSE)`). At the other extreme, for vowels, using talker-specific F1×F2 distributions increases utility over marginal distributions by log-odds of `r format_advantage(hz_marg_talker[2,], p=FALSE)` for the NSP data and `r format_advantage(hz_marg_talker[1,], p=FALSE)` for the HN15 data. Even less-specific groupings like gender still have reliable additional utility over marginal distributions for vowels (NSP: log-odds of `r format_advantage(hz_marg_gender[1,], p=FALSE, paren=FALSE)`; HN15: log-odds of `r format_advantage(hz_marg_gender[2,], p=FALSE, paren=FALSE)`). ### Normalization of vowel formants The results of study 1 showed that Lobanov normalization make talker-specific formant distributions less informative, relative to marginal. Thus, we might expect that there will be lower _utility_ for talker-specific distributions as well. However, as Figure @fig:overall-accuracy-group-known shows, the utility of talker-specific distributions _per se_ is not lower for Lobanov vs. raw F1×F2. Nevertheless, the additional utility of talker-specific over marginal distributions goes down because the baseline utility of marginal distributions goes up. Lobanov normalization removes much of the across-talker variability, leading to less overlap between the marginal distributions for individual vowels, less confusion between categories, and higher accuracy. But for individual talkers considered alone, linear transformations like Lobanov normalization have no effect, since they leave the relative positions and sizes of the category distributions unchanged. Hence, the utility of talker-specific distributions is exactly the same for raw and Lobanov normalized F1×F2. While the reduction in additional utility for talker-specific distributions is predictable based on the lower informativity (study 1), the _extent_ of this reduction in surprising: using talker-specific distributions of raw F1×F2 Hz provides additional information gain of `r format_advantage(hz_marg_talker[2,], p=FALSE, paren=TRUE)`, which drops to `r format_advantage(lob_marg_talker, p=FALSE, paren=TRUE)` after Lobanov normalization. This is comparable to the additional utility of talker-specific VOT distributions (`r format_advantage(vot_marg_talker, p=FALSE, paren=FALSE)`). That is, after normalization to remove overall shifts in F1×F2, the consequences of talker variability in vowel and stop voicing distributions for _speech recognition_ may actually be more comparable than suggested by the informativity measured in study 1. As with informativity, Lobanov normalization also reveals additional _structure_ in that talker variability. For raw F1×F2, dialect provides only weakly reliable additional utility over marginal distributions (log-odds of `r format_advantage(hz_marg_dialect, p=FALSE, paren=FALSE)`). For Lobanov-normalized F1×F2, the additional utility of dialect is both larger and more reliable (log-odds of `r format_advantage(lob_marg_dialect, p=FALSE, paren=FALSE)`). ### Dialect ```{r dialect-diffs, cache=TRUE, dependson=c("grouping-and-colors"), fig.width=7.2, fig.height=5.2, fig.cap="The advantage of knowing a talker's dialect varies by dialect. Knowing a talker comes from the North regions provides a consistent benefit, regardless of cues (Hz or Lobanov-normalized) or baseline (marginal or gender). Otherwise, dialect does not provide consistent information gain except when using Lobanov-normalized cue values, and even then it varies by dialect. Each point shows one talker, the error bars bootstrapped 95% CIs by talker, and the stars bootstrapped $p$-values adjusted for false discovery rate [@Benjamini1995]."} dialect_vs <- tribble( ~from, ~to, "Marginal", "Dialect", "Gender", "Dialect+Gender" ) %>% prettier_grouping(from) %>% prettier_grouping(to) talker_dialects <- nsp_vows %>% group_by(Talker, Dialect) %>% summarise() dialect_advantage_diffs_boot <- accuracy_by_talker_pairwise %>% filter(contrast == "Vowels (NSP)") %>% inner_join(dialect_vs) %>% left_join(talker_dialects) %>% group_by(contrast, cues, from, to, Dialect) %>% do( daver::boot_ci(., function(d,i) logodds(mean(d$accuracy_to[i])) - logodds(mean(d$accuracy_from[i])), h0=0) ) %>% mutate(boot_p_fdr = p.adjust(boot_p, "fdr"), boot_p_stars = daver::p_val_to_stars(boot_p_fdr), from_to = paste(to, "vs.", from)) accuracy_by_talker_pairwise %>% inner_join(dialect_vs) %>% inner_join(talker_dialects) %>% mutate(from_to = paste(to, "vs.", from)) %>% ggplot(aes(x=Dialect)) + geom_hline(yintercept=0, color="gray50") + geom_quasirandom(aes(y=logodds(accuracy_to)-logodds(accuracy_from), color=to), alpha=0.2) + geom_pointrange(data=dialect_advantage_diffs_boot, aes(y=observed, ymin=ci_lo, ymax=ci_high, color=to)) + geom_text(data = dialect_advantage_diffs_boot, aes(y=ci_high, label=boot_p_stars), nudge_y=0.05) + facet_grid(contrast+cues~from_to) + labs(x="", y="Additional information gain\n(log-odds ratio)", color="Advantage\nover") + ggtitle("Additional information gain from knowing dialect", subtitle = "By dialect") + rotate_x_axis_labs() + theme(plot.title=element_text(hjust=0)) + scale_color_grouping(labels=dialect_vs$from) ``` ```{r vowel-dialect-diffs, cache=TRUE, dependson=c("grouping-and-colors"), fig.width=7.2, fig.height=4.1, fig.cap="The information gained from knowing a talker's dialect also varies by the particular vowel. Vowels undergoing active sound change in multple dialects of American English (like /æ/, /ɛ/, /ɑ/, and /u/) tend to benefit more from knowing dialect. (Single talker estimates of information gain are not shown because the small sample size $n\\leq5$ for individual talkers makes them numerically unstable, while the overall log-odds ratios calculated from the mean accuarcies are more stable.) CIs are 95% bootstrapped CIs for the mean over talkers. All $p>0.01$ (corrected for false discovery rate), and whether an individual $p$ value is less or greater than $p=0.05$ is sensitive to the bootstrap and subsampling randomization so stars are not shown."} set.seed(1100) accuracy_by_talker_vowel <- accuracy %>% group_by(contrast, cues, grouping, group, Vowel, Talker) %>% summarise(accuracy = mean(accuracy), n = n()) accuracy_by_vowel_talker_pairwise <- accuracy_by_talker_vowel %>% filter(contrast == "Vowels (NSP)") %>% group_by(cues, grouping) %>% summarise() %>% nest() %>% unnest( # generate pairs of groupings data %>% map(pull, grouping) %>% map( ~ cross_df(list(to=., from=.))) %>% map(filter, as.numeric(from)% inner_join(dialect_vs) %>% # join in data for from and to inner_join(accuracy_by_talker_vowel, by=c(from="grouping", "cues")) %>% inner_join(accuracy_by_talker_vowel, by=c("contrast", "cues", "Talker", "Vowel", to="grouping"), suffix=c("_from", "_to")) dialect_advantage_by_vowel_diffs_boot <- accuracy_by_vowel_talker_pairwise %>% inner_join(talker_dialects) %>% group_by(cues, to, from, Vowel) %>% do( daver::boot_ci(., function(d,i) logodds(mean(d$accuracy_to[i])) - logodds(mean(d$accuracy_from[i])), h0=0) ) %>% ungroup() %>% mutate(boot_p_fdr = p.adjust(boot_p, "fdr"), boot_p_stars = daver::p_val_to_stars(boot_p_fdr, cutoffs = c(1, .1, .05, .01), stars = c("", "·", "*", "**")), Vowel=factor(Vowel, levels=levels(nsp_vows$Vowel))) dialect_advantage_by_vowel_diffs_boot %>% mutate(from_to = paste(to, "vs.", from)) %>% ggplot(aes(x=Vowel, color=to)) + geom_hline(yintercept=0, color="gray50") + ## very noisy (small n): ## geom_quasirandom(data = accuracy_by_vowel_talker_pairwise %>% ## mutate(from_to = paste(to, "vs.", from)), ## aes(y=logodds(accuracy_to)-logodds(accuracy_from)), alpha=0.1) + geom_pointrange(aes(y=observed, ymin=ci_lo, ymax=ci_high)) + geom_label(aes(y=observed, label=Vowel), label.padding=unit(0.1, "lines"), show.legend=FALSE) + facet_grid(cues~from_to) + ## geom_text(aes(y=ci_high, label=boot_p_stars), nudge_y=0.1, color="black") + labs(x="", y="Additional information gain\n(log-odds ratio)", color="Advantage\nover") + ggtitle("Additional information gain from knowing dialect", subtitle = "By vowel") + theme(plot.title=element_text(hjust=0)) + scale_color_grouping(labels=dialect_vs$from) ``` ```{r dialect_and_vowel, eval=FALSE} ## The goal is to make a figure somewhat like the dialect contour plots for KL ## above. but it's kind of a lot to take in, and not particularly informative accuracy_by_vowel_talker_pairwise %>% inner_join(talker_dialects) %>% group_by(Vowel, Dialect, cues, contrast, to, from) %>% do( daver::boot_ci(., function(d,i) logodds(mean(d$accuracy_to[i])) - logodds(mean(d$accuracy_from[i])), h0=0) ) %>% group_by(cues, contrast) %>% mutate(boot_p_fdr = p.adjust(boot_p, "fdr")) %>% mutate(Vowel=factor(Vowel, levels=levels(nsp_vows$Vowel))) %>% { ggplot(., aes(x=Dialect, y=observed, color=Vowel, group=Vowel)) + geom_line() + geom_point(data=subset(., boot_p_fdr<0.05)) + facet_grid(from ~cues+contrast ) + rotate_x_axis_labs() } ``` ```{r kl-vs-acc, eval=FALSE} accuracy_by_vowel_talker_pairwise %>% inner_join(talker_dialects) %>% group_by(cues, to, from, contrast, Vowel, Dialect) %>% summarise(accuracy_logodds = mean(logodds(accuracy_to) - logodds(accuracy_from))) %>% left_join(vowel_kl, by=c("contrast", "cues", Dialect="group", "Vowel")) %>% ggplot(aes(x=KL, y=accuracy_logodds)) + geom_text(aes(color=Dialect, label=Vowel)) + stat_smooth(method="lm") + stat_smooth(method="lm", aes(color=Dialect), alpha=0) + facet_grid(from ~ cues, scales="free_y") ``` Study 1 found that the informativity of dialect about formant distributions depended on both the dialect and specific vowel. Similarly, the _utility_ of using dialect-specific cue distributions (relative to marginal or gender-specific) varies by dialect (Figure @fig:dialect-diffs) and vowel (Figure @fig:vowel-dialect-diffs). Talkers from the North dialect region have a consistent additional information gain from using dialect- or dialect+gender-specific cue distributions, regardless of normalization. This likely reflects the fact that under the Northern Cities Shift the /æ/ vowel is raised, making it highly overlapping with the /ɛ/ from talkers of other dialects and leading to reduced accuracy. With un-normalized F1×F2, no other dialects show a consistent benefit from dialect-specific cue distributions (either alone or with dialect+gender). However, with Lobanov-normalized F1×F2, using dialect-specific distributions _does_ lead to better vowel recognition (on the order of log odds of 0.4) for many---but not all---dialects, especially when additionally considering gender. Somewhat surprisingly, even with normalized F1×F2, there is no consistent information gain for using dialect-specific cue distributions for Southern speakers. @Clopper2005 found that these same speakers demonstrated many of the vowel shifts that are characteristic of this dialect region [@Labov2006], and the results of study 1 (Figure @fig:vowel-kl-by-dialect, left) show that on average Southern speakers distributions do diverge from the marginal. But study 1 _also_ found that no _individual_ Southern vowel distributions diverged enough from the marginal to be significantly more informative than a random grouping of talkers (@fig:vowel-kl-by-dialect, right), at least after correcting for multiple comparisons. As with individual dialects, individual vowels vary in the extent to which conditioning on dialect provides additional information. Figure @fig:vowel-dialect-diffs shows that for most vowels, there is little evidence that conditioning on dialect consistently provides additional information gain across dialect. There is weak evidence that a few vowels may get a reliable boost with normalized formants, like /æ/, /ɛ/, and /ɑ/, all of which are undergoing sound change in at least one dialect, and also show high informativity across dialects (Figure @fig:vowel-kl-by-dialect).[^weak-pvals] `r label("r1-pvals")` [^weak-pvals]: I do not report on significance of individual vowel effects here because they are estimated using a randomized procedure---both at the level of subsampling talkers to estimate the accuracy, and at the level of bootstrapping to estimate statistical significance---and all $p>0.01$ after correcting for false discovery rate. I found that, even with a reasonably large number of subsampling and bootstrap iterations (100 and 1000, respectively), individual effects that are weakly significant in one run ($0.05 > p > 0.01$) are often only "marginally significant" ($0.1 > p > 0.05$) in another. Properly assessing the reliability of these effects is best left to future experiments designed to detect them. ## Discussion `r label("discussion2")` ```{r acc-changes} hz_err_perc <- accuracy_summary_perc %>% filter(str_detect(cues, "Hz"), str_detect(contrast, "NSP")) %>% { set_names(pull(., accuracy), pull(., grouping)) } %>% map_dbl(~ round(100*(1-.x))) %>% map_chr(~ sprintf("%d%%", .x)) lob_err_perc <- accuracy_summary_perc %>% filter(str_detect(cues, "Lobanov")) %>% { set_names(pull(., accuracy), pull(., grouping)) } %>% map_dbl(~ round(100*(1-.x))) %>% map_chr(~ sprintf("%d%%", .x)) vot_err_perc <- accuracy_summary_perc %>% filter(str_detect(cues, "VOT")) %>% { set_names(pull(., accuracy), pull(., grouping)) } %>% map_dbl(~ round(100*(1-.x))) %>% map_chr(~ sprintf("%d%%", .x)) ``` Despite dramatic differences between vowels and stop voicing in the informativity of talker- and group-conditioned distributions (study 1), the results of this study show that the _utility_ of conditioning phonetic category judgements on talker or group are more comparable, especially for normalized formants. Using talker-specific cue distributions improves correct recognition of stop voicing and vowels by about log-odds of 0.5, except for un-normalized formants, where the improvement is more like 1.5 log odds. This seems like a relatively small information gain, especially since marginal distributions themselves provide more than 4-6 times that much information gain over random guessing. However, when converted back to error percentage, the information gain from talker-specific distributions corresponds to avoiding about one out of every five errors: a change in error rate from `r lob_err_perc["Marginal"]` to `r lob_err_perc["Talker"]` for (normalized) NSP vowels, and from `r vot_err_perc["Marginal"]` to `r vot_err_perc["Talker"]` for stop voicing. These errors would not always lead to high-level misunderstanding, but avoiding them nevertheless reduces the burden on the listener to reconcile conflicting lexical, contextual, or phonetic information. While helpful, these differences in error rates show that using talker- or group-specific distributions is not a make-or-break factor in recognizing vowels or stop voicing. Rather, they make comprehension more robust and efficient. One major caveat is that this is only true for _normalized_ vowel formants. For raw Hz, using talker-specific distributions eliminates nearly two out of every three errors (`r hz_err_perc["Marginal"]` vs. `r hz_err_perc["Talker"]`). Using gender-specific distributions is only moderately helpful (error rate of `r hz_err_perc["Gender"]`). This means that listeners can benefit greatly from extracting _some_ talker-specific factor. Whether that factor is separate means and variances of each category, or the _overall_ mean and variance of each _cue_ (as is used in Lobanov normalization) is a question that reminds to be addressed in future work. As I discuss more in the discussion below, either of these is compatible with Bayesian models like the ideal adapter that learn from experience. # General Discussion Recent theories of speech recognition propose that listeners deal with talker variability by taking advantage of statistical contingencies between socio-indexical variables (talker identity, gender, dialect, etc.) and acoustic-phonetic cue distributions [@Kleinschmidt2015; @McMurray2011a; @Sumner2014]. A major question that these theories raise is _which_ contingencies listeners should learn and use. Listeners cannot learn and use every possible contingency, since they are limited by finite cognitive resources. Moreover, as I discuss below, listeners _should_ not draw on every possible contingency given their finite experience. As a first step towards answering this question, I used computational methods from ideal observer/adapter models to quantify the _degree_ and _structure_ of talker variability. I measured the extent to which a range of socio-indexical variables are 1) _informative_ about category-specific cue distributions and 2) _useful_ for recognizing phonetic categories, in two phonetic domains: vowels and word-initial stop voicing. Overall, I found that there is less talker variability for VOT than for vowel formants, and talker variability for VOT is less structured, at least according to the socio-indexical grouping variables investigated here. Variability in vowel formant distributions is _structured_, and a talker's dialect, gender, or the combination thereof are each informative about vowel-specific cue distributions. Moreover, tracking group- or talker-specific cue distributions also improves vowel recognition, although the biggest gains by far come from tracking the overall mean and variance of a talker's formants (disregarding category)---that is, the information required to normalize for overall shifts in formants. In the remainder of this paper, I discuss the implications of these results. First, the ideal adapter generally predicts that listeners should track conditional distributions for groups that are informative and useful for speech recognition. By directly quantifying the utility and informativity of a number of grouping variables, these results are a step towards making more specific predictions about what group-level representations listeners should maintain if, as assumed by the ideal adapter, they are taking advantage of the structure that is actually present in cross-talker variability. Second, I argue that my results shed light on why studies on perceptual learning have obtained seemingly conflicting results for different phonetic contrasts. Third and finally, I discuss how these measures of the informativity/utility of socio-indexical variables like gender, age, and dialect correspond to a _starting point_ for talker-specific learning. ## What to track? Even without taking into account processing limitations, an ideal adapter should not track _everything_. Rather, listeners should only track the joint distributions of variables that are informative/useful. At the level of phonetic categories themselves, this means that (for instance) there is no reason for listeners to track vowel-specific distributions of temperature or barometric pressure. Likewise for socio-indexical grouping variables: listeners get no benefit for tracking separate distributions for different groups of talkers for a cue that does not systematically vary between those groups. In fact, it can actually _hurt_ a listener to track cue distributions at a level that's not informative. The reason for this is related to one of the most central challenges to learning, the bias-variance trade-off [@James2013, Section 2.2.2]. In general, the bias-variance trade-off says that accuracy is a function of two things: the _bias_ of your model (e.g., from being too simple or having the wrong structure) and the _variance_ of the model's parameter estimates (e.g.,from not having enough data). For the present purpose, this means that tracking multiple distributions will thus result in noisier, less accurate estimates than lumping together all the observations in a single distribution. This price may be worth paying for a listener when there are large enough differences between groups that treating all observations as coming from the same distribution _biases_ the estimates of the underlying distribution (and hence the inferences that listeners make based on those distributions) far enough away from the true structure of the data. To take a concrete example, modeling each vowel as a single distribution of (un-normalized) formants across all talkers results in broad, overlapping distributions which have low recognition accuracy. But modeling them as two distributions---one for males, and one for females---provides more specific estimates and higher classification accuracy, as shown by Figures @fig:vowel-vot-kl-plot and @fig:overall-accuracy-group-known [and in @Hillenbrand1995; @Feldman2013a]. Thus, the ideal adapter predicts that listeners should learn separate cue distributions for levels of a socio-indexical grouping variable when that variable has high _informativity_ about some categories' cue distributions and/or high _utility_ for speech recognition. To be precise, this is the prediction if the goal of speech perception is the robust inference of linguistic categories (such as phonetic or phonological categories, words, or phrases). Listeners also extract, for example, social and emotional information from the speech signal. Sociolinguistic research has recognized that, in many cases, the communication of social information is just as---if not more---important than the communication of linguistic information [@Clopper2006; @Clopper2007; @Cohen2012; @Eckert2012; @Labov1972; @Remez1997; @Thomas2002]. Groupings that are _socially meaningful_ can thus be informative and useful to track with respect to the overall communicative goal, which might include the robust transmission of social identity, emotional states, and more. This means that knowledge of the joint distribution of acoustic-phonetic cues and a socio-indexical grouping can have high utility, even if ignoring that grouping has a negligible effect on speech recognition, as long as the corresponding cue distributions carry some information about relevant social variables. @KleinschmidtInPress2017 discuss this further and extend the ideal adapter to social inferences. That work---based on the same datasets I analyze here---found two examples where a socio-indexical variable can be inferred based on cue distributions, but which I found here to provide little if any additional utility for speech recognition. The first is dialect (based on vowel F1×F2) and the second is age (older than 40/younger than 30, based on VOT distributions). An additional consideration is that listeners are not simply told which variables are informative and which are not. They must actually _learn_ what distributions are actually worth tracking. Moreover, every listener's experience with talker variability will be different, and so a variable that is informative in one listener's experience may be irrelevant in another's. For example, the predictions I have derived here about the relative utility of different grouping variables for speech recognition would hold for listeners whose language experience is similar to that represented in the databases I employed. This has a two main consequences for the predictions that the ideal adapter makes. First, this means that listeners' response to talker variability should depend on their own, particular experience with talker variability. @Clopper2006 shows some evidence that this is indeed the case. Second, in order to derive predictions for a specific listener, we would need to know more details of their own personal history with talker variability. This is a difficult task, but the ideal adapter also provides tools to probe listeners' prior beliefs _directly_ [for first steps, see @Kleinschmidt2016]. Finally, I note that listeners' associations between linguistic and socio-indexical variables do not always seem to be based on _objective_ informativity of those variables. Rather, some variants can become disproportionately _salient_ or _enregistered_ [@Eckert2012; @Podesva2001; @Podesva2007; @Foulkes2015a; @Levon2014; @Jaeger2016]. These deviations between objective informativity and subjective salience remain to be explained and specified in more detail, as well as what connection---if any---there is between listeners _explicit_ social perceptions and their ability to adapt to socially-indexed linguistic variation. The methods proposed here provide a set of tools for assessing objective informativity/utility, a critical first step in understanding this relationship. ## Consequences for adapting to unfamiliar talkers The results of this study also speak to how listeners might adapt to an unfamiliar talker. The ideal adapter links informativity and utility to adaptation, and the results here allow us to make more specific predictions based on the ideal adapter, in two ways. First, the informativity of talker identity is a measure of the variability across talkers. When talker identity is highly informative, there is more variability across talkers, and the ideal adapter predicts that prior experience with other talkers will be less relevant, resulting in faster and more complete adaptation to an unfamiliar talker. I found here that talker identity is less informative about VOT distributions than it is for vowel formant distributions. Hence, the ideal adapter predicts that listeners will adapt to talker-specific VOT distributions more slowly, and be more constrained by prior experience with other talkers, compared to talker-specific formant distributions. While I am not aware of a direct quantitative test of this specific prediction, existing evidence provides indirect support for it. @Kraljic2007 found much smaller recalibration effects for a VOT contrast (on the order of 5% changes in classification) compared to a fricative contrast (around 30%) with on the same amount of exposure to each. Studies on recalibration of a word-medial /b/-/d/ contrast---which is partially cued by formant frequencies, like vowel identity---show recalibration effects of similar magnitudes to fricatives [@Kleinschmidt2015; @Vroomen2007]. This prediction is also borne out indirectly by studies that have inferred the strength of listeners' prior expectations based on their adaptation behavior [@Kleinschmidt2015; @Kleinschmidt2016]. That work finds that listeners' prior expectations are stronger---as measured by an "effective prior sample size"---when adapting to a voicing contrast (like /b/-/p/) than a stop consonant place of articular contrast (like /b/-/d/). Second, the informativity of socio-indexical grouping variables is linked to _generalization_ across talkers: if two talkers are from groups that tend to differ, listeners should be more inclined _a priori_ to treat them separately and not generalize from experience with one talker to the other. Likewise, if two talkers are from the same group, listeners _should_ generalize. I found that talker gender is informative about vowel formant distributions, but not about VOT, which means that listeners _should_ (absent other information) generalize from a male to a female talker (and vice-versa) for a voicing contrast, but _not_ for a vowel contrast. Listeners do, in fact, tend to generalize voicing recalibration across talkers of different genders [@Kraljic2006; @Kraljic2007]. While there is to my knowledge no data on cross-talker generalization for vowel recalibration, listeners tend not to generalize across talkers for recalibration of fricatives [@Eisner2005; @Kraljic2007], which (like vowels) are cued by spectral cues that vary across talkers and by gender [@Newman2001; @Jongman2000; @McMurray2011a]. Third, and conversely, listeners should be _more_ likely to generalize between two talkers who are both members of the same informative group. In the absence of evidence that two talkers from the same group (e.g. two males) produce a contrast differently, experience with one provides an informative starting point for comprehending (and adapting to) the other. There is evidence along these lines as well: @VanderZande2014 found that listeners generalize from experience with one male talker's pronunciation of a /b/-/d/ contrast to another, unfamiliar male. Note that such generalization should depend on how informative (and variable) a grouping variability like gender is _across contexts_, since generalization from experience with one other male talker in an experimental context is very different from generalization from _all_ other male talkers across all contexts. `r label("r1-context-generalization")` Finally, these predictions are best thought of as _prior biases_ that might be overcome with enough of the right kind of evidence [@Kleinschmidt2015]. For instance, listeners can overcome their bias to generalize experience with VOT and learn talker-specific VOT distributions, but it requires hundreds of observations from talkers who produce very different VOT distributions [@Munson2011]. Likewise, listeners will generalize recalibration of a fricative contrast from a female to a male talker when test stimuli are selected to increase perceptual similarity between the two test continua [@Reinisch2014]. ## A lower bound These results constitute a _lower bound_ on the informativity or utility of different levels of socio-indexical grouping. This is the case above and beyond the limitations imposed by the database that I discussed above (which required subsampling talkers in order to meaningfully compare accuracy across grouping variables). Here, cue distributions for a particular group are modeled as a _single_ normal distribution over observed cue values. In reality, a hierarchical model is more appropriate, since different levels of grouping can nest within each other, or combine orthogonally. For instance, each dialect group is likely better modeled as a _mixture_ of talker-specific distributions, which each exhibit dialect features to a varying degree. This is especially important for _adaptation_ to an unfamiliar talker, since a group-level distribution conflates _within_ and _between_ talker variation, both of which have separate roles to play in belief updating. The approach to group-level modeling that I take here is roughly equivalent to the _posterior predictive_ distribution of a fully hierarchical model, which integrates over lower levels of grouping to provide a single distribution of cues given the group (and phonetic category). This corresponds to the best guess a listener would have _before_ hearing anything from an unfamiliar talker, if the only information they had about that talker was their group membership. As the listener hears more cue values from the talker, the hierarchical nature of grouping structure becomes more important and can provide (in principle) a significant advantage over what I measured here. But modeling this process is quite a bit more complicated and is left for future work. Nevertheless, modeling each category as a single, "flat" distribution per group may well prove a useful approximation, or even a boundedly-rational model of how listeners take advantage of different levels of grouping structure [and similar approaches have been used in, e.g., motor control; @Kording2007]. ## Consequences for perspectives on normalization `r label("normalization-gen-dis")` Normalizing vowel formants with respect to each talker's overall mean and variance substantially reduces the amount of talker variability, and also changes the _structure_ of that variability: gender matters much less, while the effects of dialect become more apparent. Much of the work on vowel normalization treats normalization as a low-level auditory adaptation or habituation process that eliminates the need for active inferences on the listener's part [e.g., @Holt2006c; @Huang2012; @Laing2012; @Nearey1989; for a review see @WeatherholtzInPress]. But low-level sensory adaptation is increasingly recognized as a sort of distributional learning, much like the ideal adapter proposes for speech recognition [for a review of these parallels, see @Kleinschmidt2015b]. I used normalization as a methodological tool, but it would be possible to treat the normalization parameters as another aspect of a talker's particular language model that must be inferred, just like the means and (co-)variances of various individual vowel distributions. That is, it is possible that an ideal adapter would do better by learning talker-/group-specific distributions in a normalized space, and additionally inferring the normalization factors (shift, scaling, etc.) for each talker they encounter. If this parallel is appropriate, then it suggests a more complex interaction between normalization and adaptation/perceptual learning as strategies for coping with talker variability, and makes a number of predictions. For instance, instead of just taking a running average of recent spectral content [@Huang2012] or using extreme vowels as "anchors" [as in many normalization methods; @Flynn2011], normalization could be accomplished much more efficiently by leveraging category-level information (which is often provided by, e.g. lexical context) and knowledge of cue distributions in normalized space: a single token of any vowel (with the category known) can provide enough information to get a reasonably good guess about the talker's normalization factors. This in turn predicts sensitivity to _both_ the un-normalized formant frequencies _and_ the normalized ones. In this case, group-level expectations that are only informative about distributions in normalized space (e.g., dialect for vowels) could nevertheless help with adaptation, even before a talker's entire cue space is known. Furthermore, @Chodroff2017 found that talker variation in VOT could also be largely characterized in terms of overall shifts/scaling of VOT distributions (as evidenced by large, positive correlations across talkers between the means and variances of different categories). This suggests that tracking talker-specific normalization factors may be a generally useful strategy across different phonetic contrasts (or even features). That is, listeners may benefit from factoring talker variation into components that are shared across categories and components that are shared across talkers (as I've examined here). But this parallel remains to be investigated in future work. # Conclusion I have demonstrated methods to quantify the amount and structure of talker variability in phonetic category-specific cue distributions. These methods are derived directly from the ideal adapter framework [@Kleinschmidt2015] which treats speech perception as a process of inference under uncertainty and variability. The results I present here for word-initial stop voicing (cued by VOT) and vowel identity (cued by F1×F2) are a first step towards making quantitative predictions with the ideal adapter about how listeners cope with different aspects of talker variability. They also provide a way of formalizing the salience or relevance of socio-indexical information that exemplar/episodic theories propose is stored alongside acoustic traces [e.g., @Sumner2014]. Finally, together with similar work showing that socio-indexical judgements can be modeled as the same kind of inference under uncertainty [@KleinschmidtInPress2017], this work suggests a framework for unifying psycholinguistic and sociolinguistic perspectives on talker variability. ```{r session-info, results="markup", message=TRUE} if (opts_knit$get("rmarkdown.pandoc.to") != 'latex') { options(width=100) devtools::session_info() } ``` ```{r refs, results="asis", echo=FALSE} if (opts_knit$get("rmarkdown.pandoc.to") != 'latex') { cat("# References") } ```