---
title: "Structure in talker variability: How much is there and how much can it help?"
shorttitle: "Structure in talker variability"
author:
- Dave F. Kleinschmidt
bibliography: talker-variability.bib
abstract: >
    One of the persistent puzzles in understanding human speech perception is
    how listeners cope with talker variability.  One thing that might help
    listeners is structure in talker variability: rather than varying randomly,
    talkers of the same gender, dialect, age, etc. tend to produce language in
    similar ways.  Listeners are sensitive to this covariation between
    linguistic variation and socio-indexical variables.  In this paper I present
    new techniques based on ideal observer models to quantify 1) the amount and
    type of structure in talker variation (_informativity_ of a grouping
    variable), and 2) how useful such structure can be for robust speech
    recognition in the face of talker variability (the _utility_ of a grouping
    variable).  I demonstrate these techniques in two phonetic
    domains---word-initial stop voicing and vowel identity---and show that these
    domains have different amounts and types of talker variability, consistent
    with previous, impressionistic findings.  An `R` package
    ([`phondisttools`](https://github.com/kleinschmidt/phondisttools))
    accompanies this paper, and the source and data are available from
    [osf.io/zv6e3](https://osf.io/zv6e3/).
authornote: >
    I gratefully acknowledge Cynthia Clopper, Shannon Heald, Andy Wedel, and
    Noah Nelson for sharing their measurements of speech production data with
    us.  Without their generosity this work would not have been possible.  The
    techniques proposed here were originally developed jointly with Kodi
    Weatherholtz.  I thank Florian Jaeger for feedback on earlier versions of
    this work, as well as Rory Turnbull and two anonymous reviewers.  I also
    thank the developers of the R
    language [@R2017] as well as the following    packages: `tidyverse`
    [@tidyverse], `rmarkdown` [@rmarkdown], `knitr` [@knitr], `cowplot`
    [@cowplot], `mvtnorm` [@mvtnorm], and `ggbeeswarm` [@ggbeeswarm].
    This
    work was partially funded by NIH NICHD R01 HD075797 and NIH NICHD F31
    HD082893.  The views expressed here are those of the author and not
    necessarily those of the funding agencies.  Address correspondence about
    this article to Dave F. Kleinschmidt, Princeton Neuroscience Institute,
    Princeton University, Washington Road, Princeton NJ 08544, email
    <davidfk@princeton.edu>
output:
    html_document:
        code_folding: hide
        dev: png
        keep_md: true
        md_extensions: +implicit_figures+pipe_tables+table_captions
        pandoc_args:
        - --filter
        - pandoc-fignos
        - --filter
        - pandoc-tablenos
        - --filter
        - pandoc-eqnos
        - --csl
        - apa.csl
    pdf_document:
        md_extensions: +implicit_figures+tex_math_single_backslash+pipe_tables+table_captions
        keep_tex: true
        latex_engine: xelatex
        template: apa6.template.tex
        citation_package: biblatex
        dev: cairo_pdf
        pandoc_args:
        - --filter
        - pandoc-fignos
        - --filter
        - pandoc-tablenos
        - --filter
        - pandoc-eqnos
mainfont: "CMU Serif"
---

```{r preamble, message=FALSE, warning=FALSE, error=FALSE, echo=FALSE, results='hide'}

library(knitr)
opts_chunk$set(message=FALSE,
               warning=FALSE,
               error=FALSE,
               echo=opts_knit$get("rmarkdown.pandoc.to") != 'latex',
               cache=TRUE,
               results="hide")

## Produce markdown-formatted figures so that pandoc knows what to do with
## the captions. requires pandoc-fignos to parse the IDs. refer to figures
## in text with {@fig:label} or just @fig:label
## 
## (see https://github.com/tomduck/pandoc-fignos)
knit_hooks$set(plot = function(x, options) {
  paste0('![', options$fig.cap, ']',
         '(', opts_knit$get('base.url'), paste(x, collapse='.'), ')',
         '{#fig:', options$label, '}')
})
## Produce markdown-formatted table captions with anchors for cross-refs.
## Requires pandoc-tablenos to parse the IDs. Refer to tables
## in text with {@tbl:label} or @tbl:label.
## Based partly on http://stackoverflow.com/a/18672268
##
## (see https://github.com/tomduck/pandoc-tablenos)
knit_hooks$set(tbl.cap = function(before, options, envir) {
  if(!before){
    paste0('\n\nTable: ', options$tbl.cap,
           ' {#tbl:', options$label, '}', sep = '')
  }
})

library(tidyverse)
library(magrittr)
library(rlang)
library(stringr)
library(forcats)
library(purrrlyr)
library(assertthat)

library(mvtnorm)

library(multidplyr)
library(future)
plan(multicore)

library(ggbeeswarm)
library(svglite)
library(ggplot2)
library(cowplot)
library(ggrepel)

## devtools::install_github('kleinschmidt/daver')
library(daver)

## devtools::install_github("kleinschmidt/phondisttools")
library(phondisttools)

## devtools::install_github('kleinschmidt/nspvowels')
library(nspvowels)

## devtools::install_github("kleinschmidt/healdvowels")
library(healdvowels)

## devtools::install_github('kleinschmidt/votcorpora')
library(votcorpora)


## cowplot theme + y axis gridlines
theme_set(theme_cowplot() %+replace%
            theme(panel.grid.major = element_line(colour='gray90', size=0.2),
                  panel.grid.minor = element_line(colour='gray98', size=0.5),
                  panel.grid.major.x = element_blank(),
                  panel.grid.minor.x = element_blank()))
rotate_x_axis_labs <- function(by=45) theme(axis.text.x = element_text(angle=by, hjust=1))


assert_has_names <- function(x, names) {
  walk(names, ~assert_that(has_name(x, .)))
}

apply_groupings <- function(d, groupings) {
  assert_that(has_name(d, "data"), msg="d is not nested (no data column)")
  walk(d$data, ~assert_has_names(., groupings))
  groupings %>% 
    map(~ d %>% mutate(data = map(data, group_by_, .x),
                       grouping = .x)) %>%
    reduce(bind_rows)
}

#' Train models for each combination of group and phonetic category
#'
#' For a variety of grouping levels.
#'
#' @param data_grouped a tbl with the grouped data. columns `data` (a list
#'   column with tbls suitable to pass to `phondisttools::train_models`, which are
#'   grouped according to the corresponding value of `grouping` and also have
#'   the columns named in category_col, and cue_cols) and `grouping` which names
#'   the grouping variable that defines the groups of `data`.
#' @param category_col the name of the column in the elements of
#'   `data_grouped$data` that have the phonetic category
#' @param cue_cols the name of the column(s) in the elements of
#'   `data_grouped$data` that have the cue values to train models on.
#'
#' @return a tbl with columns `grouping` and `models` (a list column of tbls,
#'   which hold trained models for each combination of `category_cols` and
#'   `group`), plus any additional columns in `data_grouped`.
#' 
train_models_grouped <- function(data_grouped, category_col, cue_cols, ...) {

  ## check input format
  assert_that(has_name(data_grouped, 'data'),
              has_name(data_grouped, 'grouping'))

  train <- partial(phondisttools::train_models,
                   grouping=category_col, cues=cue_cols,  ...)
  
  data_grouped %>%
    mutate(models = map2(data, grouping, ~ train(.x) %>% rename_(group=.y)),
           data = NULL)
  
}


```

```{r grouping-and-colors, cache=TRUE}

grouping_levels <- c('Marginal', 'Age', 'Gender', 'Dialect', 'Dialect+Gender', 'Talker')

gender <- function(x) str_replace(x, "Sex", "Gender")

prettier_grouping <- function(d, col=grouping) {
  col <- enquo(col)
  d %>%
    mutate(!!quo_name(col) :=
             map_chr(!!col, str_replace, "_", "+") %>%
             map_chr(gender) %>%
             factor(levels=grouping_levels))
}

scale_color_grouping <- function(groupings=grouping_levels, ...) {
  n <- length(groupings)
  colors <- c("#808080", "#FF8500", "#00D2F7", "#00C928", "#FF20AF", "#7A78FF")
  ggplot2:::manual_scale("colour",
                         values = set_names(colors, groupings),
                         ...)
}

pretty_contrast <- function(d) {
  mutate(d,
         contrast = factor(contrast, levels=c("Vowels (HN15)",
                                              "Vowels (NSP)",
                                              "Stop voicing")))
}

```

```{r labeler}

## a small function that creates \label{}s for latex output
if (opts_knit$get("rmarkdown.pandoc.to") == "latex") {
  label <- function(lab) paste0("\\label{", lab, "}")
} else {
  label <- function(lab) ""
}

```


# Introduction

The apparent ease and robustness of spoken language understanding belie the
considerable computational challenges involved in mapping speech input to
linguistic categories. One of the biggest computational challenges stems from
the fact that talkers differ from each other in how they pronounce the same
phonetic contrast. One talker’s realization of /s/ (as in “seat”), for example,
might sound like another talker’s realization of /ʃ/ (as in “sheet”)
[@Newman2001]. During speech perception, such inter-talker variability
contributes to the *lack of invariance* problem, creating uncertainty about the
mapping between acoustic cues and linguistic categories [@Liberman1967].

A number of proposals for how listeners overcome this problem have been offered.
A common theme that has emerged is that listeners seem to take advantage of statistical
contingencies in the speech signal [for a recent review, see
@WeatherholtzInPress].  These contingencies result in part from the fact that
inter-talker variability is not random.  Rather, inter-talker differences in the
cue-to-category mapping are systematically conditioned by a range of
factors. This includes both talker-specific anatomy of the vocal tract
[@Fitch1999; @Johnson1993] and factors pertaining to a talker's social-indexical
group memberships, such as age [@Lee1999], gender [@Perry2001; @Peterson1952],
and dialect [@Labov2006].

Listeners seem to draw on these statistical contingencies between linguistic
variability on the one hand and talker- and group-specific factors on the
other. Upon encountering an unfamiliar talker, for example, the speech
perception system seems to adjust the mapping of acoustic cues to linguistic
categories to reflect that talker's specific distributional statistics
[@Bejjanki2011; @Clayards2008; @Idemaru2011; @Kraljic2007; @McMurray2011a].
Listeners also seem to learn and draw on expectations about cue-category
mappings based on a talker’s socio-indexical group memberships.  For example,
listeners have been found to adjust their speech recognition based on a talker's
inferred regional origin [@Hay2010; @Niedzielski1999], gender [@Strand1999;
@Johnson1999], age [@Walker2011], and individual identity [@Mitchel2016;
@Nygaard1994].

Such talker- and group-specific knowledge is now broadly believed to be critical
to speech perception [for reviews, see @Foulkes2015a; @WeatherholtzInPress].  An
important question that has largely remained unaddressed, however, is how
listeners determine _which_ socio-indexical (and other) talker properties should
be used for speech perception.  In other words, why do listeners group talkers
by, for example, age and gender, rather than the color of their shirt? _A
priori_, there is an essentially infinite number of ways for a listener to group
the speech they have experienced in different situations.

Intuitively, we might expect listeners to be sensitive to socio-indexical
properties that are _relevant_ to speech perception. Some of the possible socio-
indexical groupings will be highly informative about the future cue-category
mappings that a listener can expect, while others will be uninformative, or even
misleading.  This paper seeks to formalize this intuition, in order to derive
principled quantifiable predictions for future work, drawing on a recently
proposed computational framework, the ideal adapter [@Kleinschmidt2015].

`r label("r2-ideal-adapter-intro-ps")`
The ideal adapter is a computational-level theory of human speech perception [in
the sense of @Marr1982].  It seeks to explain aspects of speech perception by
formalizing the _goals_ of speech perception and the _information_ available
from the world.  Like many computational-level models, it treats speech
perception as a problem of _inference under uncertainty_, whereby listeners
combine what they know about how speech is generated in order to recover (or
infer) the most likely explanation for the speech sounds they hear.  In this
view, talker
variability is a primary challenge for speech perception because the most
likely explanation for a particular acoustic cue depends on the probabilistic
distributions of cues for each possible explanation, and these distributions
differ from talker to talker [e.g., @Allen2003; @Newman2001; @Hillenbrand1995].
The central insight of the ideal adapter is twofold.  First, when talker
variability is not completely random there is a great deal of information
available from _previous experience_ with other talkers about the probabilistic
distribution of acoustic
cues that correspond to each possible linguistic unit.  Second, in order to
benefit from this information listeners must actively _learn_ the underlying
structure of the talker variability that they have previously experienced, and
this learning can be modeled as statistical inference itself.

In other words, according to the ideal adapter, robust speech perception depends
on _inferring_ how talkers should be grouped together.  Thus far this is just a
re-statement of the original question---which groups of talkers are worth
tracking together?---but the ideal adapter also provides the theoretical
framework for _answering_ it as well.  According to the ideal adapter
this inference depends on two related but distinct factors.  The first factor is whether
there is any statistically reliable grouping to be learned in the first place, or
whether a hypothetical grouping leads to better predictions about
acoustic-phonetic cues.  The second is whether grouping talkers in a particular
way leads to better speech recognition.  That is, given a particular hypothesis
about how previously encountered talkers might be grouped, an ideal adapter must
ask themselves two questions: is this way of grouping talkers _informative_
about the acoustic-phonetic cue distributions that I have heard, and would
grouping those distributions in this way be _useful_ for recognizing a future
talker's linguistic intentions (e.g., phonetic categories)?[^socio-perc]

[^socio-perc]: There are other, potentially important uses for tracking
    group-specific distributions, even when they don't aid speech perception per
    se.  For instance, listeners could use group-specific phonetic cue
    distributions to infer the age, gender, regional origin, etc. of an
    unfamiliar talker [@KleinschmidtInPress2017], and such inferences may play
    an important role in coordinating group behavior [e.g., @Cohen2012]

The answers to these questions can vary depending on the particular language,
hypothetical grouping of talkers, and phonetic category, as well as
each listener's idiosyncratic experience of talker variability.  The goal
of this paper is thus to not only show how these questions are formalized by the
ideal adapter, but also to quantify the _amount_ and _structure_ of talker
variability across two different phonetic domains (vowel identity and stop
voicing).

`r label("sense-of-structure")`

Note that there are a number of different senses in which talker variability
might be structured.  Here, I focus on the extent to which variability in the
acoustic realization of phonetic categories _across talkers_ is predictable and
hence support generalization based on previous experience, based on
socio-indexical or other grouping variables.  This is different from structure
_across categories_, as in the covariation in talker-specific mean VOTs for /b/,
/p/, /d/, etc. [e.g., @Chodroff2017], as well as structure _across cues_, within
a single category [e.g. VOT and f0 for stop voicing, @Clayards2018; @Kirby2015;
etc.].  All of these sorts of structure are complimentary because they mean that
observations from one talker/category/cue dimension are informative about
others, and I will return to the connection in the general discussion.

There are two main motivations for developing and testing these techniques.
First, such quantitative assessments of
the degree and structure of talker variability are a critical missing link in
the research program set out by the ideal adapter.  The ideal adapter makes
predictions about when listeners should employ different strategies for coping
with talker variation---when they should rapidly adapt, or maintain stable,
long-term representations of particular talkers, or generalize from experience
with one or a group of different talkers.  These predictions depend in large
part on how much and what kind of structure there is in talker variability.  The
techniques I propose here provide the necessary grounding to turn the
qualitative predictions of @Kleinschmidt2015 into testable, quantitative
predictions.

Second, these techniques offer a general technique for quantitatively assessing
the structure of talker variability from speech production data in a variety of
contexts, across phonetic systems, languages, and even levels of linguistic
representation.  A further advantage of the techniques proposed here is that
they are directly, quantitatively comparable across different phonetic categories
and sets of cues.  As such they are, I hope, generally useful to speech
scientists and sociolinguists in a variety of theoretical frameworks,
including exemplar/episodic accounts [@Johnson1997; @Goldinger1998;
@Pierrehumbert2006] and normalization/cue-compensation accounts [e.g.,
@Cole2010; @Holt2005; @McMurray2011a].  For example, in exemplar/episodic
accounts, it is sometimes assumed that speech inputs are stored along with
"salient" social context [e.g., @Sumner2014].  What determines the salience of
contexts is, however, left unspecified [for related discussion, see
@Jaeger2016].  The informativity and utility measures explored here might serve
to define and quantify salience.  Additionally, the specific predictions I
derive below pertain to native listeners' perception of native American
English. However, this approach is more general, extending, for example, to
non-native perception, and native perception of foreign-accented speech.

In service of this goal, I have developed an R [@R2017] package,
[`phondisttools`](https://github.com/kleinschmidt/phondisttools).  The code that
generated this paper is available from [osf.io/zv6e3](https://osf.io/zv6e3/), in
the form of an RMarkdown document, along with the datasets.

## Outline and preview of results

`r label("r2-intro-preview")` 

The rest of this paper is structured as follows.  The following section presents
the basic logic of the ideal adapter, which motivates the measures of
informativity and utility.  The section after that describes the general methods
used to estimate phonetic cue distributions, and the data sets that are analyzed
below.  The section after that defines and examines the _informativity_ of
socio-indexical variables about cue distributions themselves (Study 1).

The results of Study 1 show that, at a broad level, vowels show more talker
variability than stop voicing.  This is consistent with previous,
impressionistic findings but is based on a principled measure that allows
_direct_ comparisons between the two phonetic systems, and serves as a proof of
concept that this measure can be applied in other domains.  At a more
fine-grained level, these results also show that this variability is
_structured_ by some socio-indexical variables, but not all, and that this
structure depends on how cues themselves are represented.  The fact that
structure in talker variability _exists_ does not necessarily mean that it
will be _useful_ in speech recognition---or, conversely, that ignoring it will
be harmful---which motivates the notion of _utility_ that is defined and
evaluated in Study 2.

The results of Study 2 show, first, that informativity largely predicts utility:
talker-specific cue distributions provide a consistent advantage over nearly
every less-specific grouping of talkers, and groupings that were more
informative than expected by chance also provide (often modest) improvements in
successful recognition.  Second, Study 2 finds that these gains in utility are
often rather modest.  Third, and relatedly, large differences in informativity
do not always lead to similarly large differences in utility.

Finally, in the general discussion I review the implications of these findings
for understanding how listeners track talker variability in order to understand
speech more robustly.  On the one hand, these results suggest that there are
meaningful groupings of talkers exist for listeners to learn from their
experience, and that doing so can make speech perception more robust.  On the
other hand, they show that not every socially-indexed way of grouping talkers is
informative or useful for speech recognition per se, and that informativity and
utility furthermore depend on the way that acoustic cues are represented.

# The ideal adapter

`r label("r2-ideal-adapter")`

```{r ideal-adapter-schematic, fig.width=10.5, fig.height=6, warning=FALSE, fig.cap="How well a listener can recognize the phonetic category [**A**, e.g., /s/ vs. /ʃ/; loosely based on @Newman2001] a talker is producing depends on what the listener knows about the underlying cue distributions (**B**).  These distributions _vary_ across talkers, which results in variability in the best category boundary.  Each talker's cue distributions can be characterized by their _parameters_ (**C**; e.g. the mean of /s/, mean of /ʃ/, variance of /s/, etc.; together denoted θ).  Each point in **C** corresponds to a pair of distributions in **B** and one category boundary in **A**.  _Groups_ of talkers are thus distributions in this high-dimensional space (**C**, ellipses); marginalizing (averaging) over a group smears out the category-specific distributions (thick lines in **B**) and thus the category boundary (**A**).  Thus, Jose's /s/ and /ʃ/ are best classified using his own distributions (purple), in the sense that this leads to a steeper boundary at a different cue value compared to the boundary from the _marginal_ distributions over all talkers (gray) or other males (light blue)."}

theme_schematic <-
  theme_cowplot() +
  theme(legend.position="none",
        plot.title= element_text(hjust=0))

theme_blank_y_axis <-
  theme(axis.title.y = element_blank(),
        axis.line.y = element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank())

## generate parameters

n <- 30
s_sh_mu <- 0.8
s_sh_offset <- -1.5
sd_0 <- 1
sd_sd <- 0.1
s_sh_sd <- 0.02
gender_offset <- 3

who_labels = c("Judith (same lang)",
               "Jim (same gender+lang)",
               "Jose") %>%
  factor(., levels=.)

whos = tibble(who = who_labels,
              who_short = c("language", "gender", "Jose"))

who_colors <- set_names(c("#888888", "#00D2F7", "#7A78FF"), who_labels)

set.seed(1002)
params <- tibble(talker = seq_len(n),
                 gender = rep(c("Male", "Female"), length.out=n),
                 who = case_when(
                   talker == 1 ~ who_labels[3],
                   gender == "Male" ~ who_labels[2],
                   TRUE ~ who_labels[1]
                 ),
                 mu_sum = rnorm(n) + gender_offset*(gender == "Female"),
                 mu_diff = rnorm(n, mean = s_sh_offset, sd=.7),
                 mu_s = mu_sum - mu_diff,
                 mu_ʃ = mu_sum + mu_diff,
                 sd_s = sd_0 + rnorm(n, sd=sd_sd),
                 sd_ʃ = sd_s + rnorm(n, sd=s_sh_sd)) %>%
  mutate_at(vars(mu_s:sd_ʃ), funs( ./6 * (5900-5400) )) %>%
  mutate_at(vars(mu_s:mu_ʃ), funs( . + 5400))

## PLot param space

p_params <- ggplot(params, aes(x=mu_s, y=mu_ʃ, color=who)) +
  stat_ellipse(data = . %>% filter(gender=="Male"),
               type="norm", geom="polygon", fill=who_colors[2], color=NA, alpha=0.2) +
  stat_ellipse(data = . %>% mutate(who=NULL),
               type="norm", geom="polygon", fill=who_colors[1], color=NA, alpha=0.2) +
  geom_point(aes(size=who)) +
  geom_abline() +
  coord_equal() +
  geom_label_repel(data = . %>%
                     group_by(who) %>%
                     mutate(who2 = ifelse(row_number() == 1, as.character(who), "")),
                   aes(label=who2),
                   hjust = 0, box.padding = 2, label.size=NA,
                   fill = gray(1, alpha=0.7)) +
  scale_color_manual(values = who_colors) +
  scale_fill_manual(values = who_colors) +
  scale_size_manual(values = c(1.5, 1.5, 3)) +
  theme_schematic +
  labs(x = "θ₁ = mean COG of /s/",
       y = "θ₂ = mean COG of /ʃ/") +
  annotate("text", x=5650, y=5100, label="p(θ | male)", color=who_colors[2], hjust=0) +
  annotate("text", x=5800, y=5250, label="p(θ | Am. Eng.)", color=who_colors[1], hjust=0)

## generate PDFs

params_by_cat <- params %>%
  gather("param", "value", mu_s:sd_ʃ) %>%
  separate(param, c("param", "category"), "_") %>%
  spread("param", value)

params_to_pdfs <-
  . %>%
  mutate(pdfs = map2(mu, sd, ~ tibble(cog = seq(min(mu-3*sd), max(mu+3*sd), length.out=1000),
                                      lhood = dnorm(cog, mean=.x, sd=.y)))) %>%
  unnest(pdfs)

pdfs <- params_by_cat %>% params_to_pdfs

param_samples <-
  params_by_cat %>%
  unnest(cog=map2(mu, sd, rnorm, n=1000))

group_pdfs <-
  c(quo(TRUE), quo(gender == "Male"), quo(talker == 1)) %>%
  map2(who_labels,
       ~ filter(param_samples, !!.x) %>%
         group_by(category) %>%
         summarise(mu=mean(cog), sd=sd(cog)) %>%
         mutate(who=.y)) %>%
  bind_rows() %>%
  params_to_pdfs

## plot PDFs

pdf_labs <-
  group_pdfs %>%
  left_join(whos) %>%
  group_by(who, category) %>%
  filter(lhood == max(lhood),
         who_short == "Jose" | category == "s") %>%
  mutate(lhood_lab = map2_chr(category, who_short, ~ paste("p(COG | ", .x, ", ", .y, ")", sep="")))

p_pdfs <- ggplot(pdfs, aes(x=cog, y=lhood, linetype=category)) +
  geom_line(data = . %>% filter(talker != 1),
            aes(group=interaction(talker, category), color=who), alpha=0.2) +
  geom_line(data=group_pdfs,
            aes(group=interaction(who, category),
                color=who, size=who)) +
  geom_label(data = pdf_labs, aes(label=lhood_lab, color=who), label.size=NA,
             hjust=0, vjust=0, nudge_x=25,
             fill=gray(1, 0.8)) +
  scale_color_manual(values = who_colors) +
  scale_size_manual(values = set_names(c(1, 1, 1.5), who_labels)) +
  labs(x = "Spectral center of gravity (COG; Hz)") +
  theme_schematic +
  theme_blank_y_axis

## generate classification functions

pdfs_to_posteriors <-
  . %>%
  select(-mu, -sd) %>%
  spread(category, lhood) %>%
  filter(!is.na(s), !is.na(ʃ),
         cog > quantile(cog, 0.2),
         cog < quantile(cog, 0.8)) %>%
  mutate(p_s = s / (s+ʃ))

post_labels <-
  group_pdfs %>%
  pdfs_to_posteriors %>%
  left_join(whos) %>%
  mutate(post_lab = map_chr(who_short, ~ paste0("p(s | COG, ", .x, ")"))) %>%
  group_by(who) %>%
  mutate(near80 = abs(cog - 5430),
         post_lab = ifelse(near80 == min(near80), post_lab, ""))

## plot classifcation functions

p_posteriors <-
  pdfs %>%
  # filter(ifelse(category == "s", cog-1 <= mu, cog+1 >= mu)) %>%
  pdfs_to_posteriors %>%
  ggplot(aes(x=cog, y=p_s, color=who)) +
  geom_line(aes(group=talker), alpha=0.2) +
  geom_line(aes(size=who), data = pdfs_to_posteriors(group_pdfs)) +
  geom_label(data=post_labels %>% filter(post_lab != ""), aes(label=post_lab),
             hjust=0, vjust=1.1,
             label.size=NA, fill=gray(1, alpha=0.7)) +
  scale_size_manual(values = set_names(c(1, 1, 1.5), who_labels)) +
  scale_color_manual(values = who_colors) +
  theme_schematic

## put the whole thing together

plot_grid(p_posteriors +
            theme_blank_y_axis +
            theme(axis.title.x = element_blank(),
                  axis.text.x= element_blank()) +
            lims(x = range(pdfs$cog)) +
            ggtitle("A: Classification of categories /s/ vs. /ʃ/",
                    subtitle="posterior probability p(/s/ | COG, group)"),
          p_pdfs +
            ggtitle("B: Cue distributions",
                    subtitle="likelihood p(COG | category, ɡroup)"),
          ncol=1, rel_heights=c(1, 1.1)) %>%
  plot_grid(.,
            p_params +
              ggtitle("C: Talkers in distribution parameter space",
                      subtitle=expression(paste("p(θ = ", group("[", list(μ[s], μ[ʃ], σ[s], ...), "]"), " | group)"))),
            ncol=2)


```

This section briefly introduces the logic of the ideal adapter model [for a more
detailed introduction, see @Kleinschmidt2015].  Figure
@fig:ideal-adapter-schematic provides a hypothetical illustration of this logic
for an /s/-/ʃ/ contrast [loosely based on @Newman2001].

Both the informativity and the utility of a particular grouping of talkers is
defined based on the linguistic cue-category mappings for each implied group.
In the ideal adapter, like other ideal observer models, these cue-category
mappings are represented as _category-specific cue distributions_, or
the probability distribution of observable cues associated with each underlying
linguistic category [phoneme or phonetic category; @Clayards2008; @Feldman2009;
@Norris2008].  This is a direct consequence of how these models treat perception
as a process of inference under uncertainty, formalized using Bayes rule:
\[
p(\mathrm{category}=c | \mathrm{cue}=x) \propto p(\mathrm{cue}=x |
\mathrm{category}=c) p(\mathrm{category}=c)
\]
That is, the _posterior_ probability of category $c$ given an observed cue value
$x$ is proportional to the _likelihood_ that that particular cue value would be
generated if the talker intended to say $c$, $p(x | c)$, times the _prior_
probability, or how probable category $c$ is in the current context (regardless
of the observed cue value).  For good performance, the likelihood function
$p(x|c)$ should be as close as possible to the actual distribution of cues that
correspond to category $c$ in the current context.

However, these cue distributions potentially _differ_ across contexts, due to
talker variability (Figure @fig:ideal-adapter-schematic, B), and thus the ideal
category boundaries can differ as well (Figure @fig:ideal-adapter-schematic, A).
Listeners thus must also take into account their limited knowledge about the cue
distributions, given what they know about who is currently talking.  The central
insight of the ideal adapter [@Kleinschmidt2015] is that these uncertain beliefs
can be modeled as another probability distribution, over the parameters of the
category-specific distributions themselves $\theta$, given a talker of type $t$:
$p(\theta | t)$ (Figure @fig:ideal-adapter-schematic, C).  The type of talker
could be a member of some socio-indexical group like $t=\mathrm{male}$ (blue),
or a specific individual $t=\mathrm{Jose}$ (purple), or even a generic speaker
$t=\mathrm{American\ English}$ (gray).  In each case, the listener will have
more or less uncertainty about the cue distributions that this type of talker
will produce.  Treating speech recognition as inference under uncertainty allows
us to formalize how this additional uncertainty about the category-specific cue
distributions affects speech recognition by _marginalizing_ over possible cue
distributions in order to compute the likelihood (Figure
@fig:ideal-adapter-schematic B, thick lines):
\[
p(x | c, t) = \int p(x | c, \theta) p(\theta | t) \mathrm{d}\theta
\]
Marginalization is essentially a weighted average of the likelihood
under each possible set of cue distributions $p(x|c,\theta)$, weighted by how
likely those particular distributions are for a talker of a particular type,
$p(\theta | t)$.  As an example, the likelihood of a male talker producing /s/
with a spectral center of gravity (COG) of exactly 5500Hz is determined by
averaging the likelihood of that COG value under a distribution with a mean of
5400Hz and a standard deviation of 80Hz, with the likelihood under every other
possible combination of means and variances, each weighted by how likely it is
that a male talker would produce that particular distribution for /s/.

Thus, if a listener has grouped together all the male talkers they have
previously encountered, they can use their knowledge of the group-level cue
distributions to recognize speech from other male talkers they might encounter
in the future.  The properties of these socio-indexically conditioned,
category-specific cue distributions provide a natural way to measure how much a
particular socio-indexical grouping variable is informative or useful with
respect to a particular set of phonetic categories.  As detailed below in
Studies 1 and 2, _informativity_ is defined based on the group-level
_distributions_ themselves (e.g., thick lines in Figure
@fig:ideal-adapter-schematic B), while _utility_ is defined based on the
classification functions/category boundaries those distributions imply (e.g.,
thick lines in Figure @fig:ideal-adapter-schematic A).  These measure are
derived directly from treating speech perception as a process of inference under
uncertainty and talker variability.

# General methods

## Measuring distributions

The socio-indexically conditioned, category-specific cue distributions were
estimated in the following way.  First, it is assumed that each phonetic
category can be modeled as a normal distribution over cue values (stop voicing
as univariate distributions over VOT, and vowels as bivariate distributions of
F1 and F2).  Each distribution is parameterized by its mean and covariance
matrix (or, equivalently, variance in the case of VOT).  Next, the mean and
covariance were fitted to the samples of cue values from the corpora using the
standard, unbiased estimators for the mean and covariance[^cov].  This was done
separately for each group/talker, including the group of all talkers (to
estimate the marginal distributions).  For example, for gender, one /æ/
distribution was obtained from all the tokens from male talkers, and one from
all tokens from female talkers.  Likewise, for dialect, one distribution was
obtained based on all tokens from talkers from the North dialect region, another
one from tokens from Mid-Atlantic talkers, and so on.

[^cov]: Using the `mean` and `cov` functions in R 3.4.1 [@R2017].

Assuming that each category is a normal distribution is not a critical part of
the proposed approach, but rather a standard and convenient assumption. In particular,
the normal distribution has a small number of parameters and this allows us to
efficiently estimate the distribution for each category with a limited amount of
data (e.g., five tokens per talker-level vowel distribution).  But the proposed
method is fully general, and works with _any_ distribution (including discrete
or categorical distributions for phonotactics, syntax, etc.).

An additional simplifying assumption here is that there is no further,
talker-specific learning that occurs.  In the ideal adapter, group-conditioned
cue distributions reflect the _starting point_ for talker- or situation-specific
distributional learning.  As I discuss below, the measures I present are best
thought of as a _lower-bound_ on informativity/utility that is much easier to
estimate from small quantities of speech production data.

## Data sets

I analyze the informativity and utility for two types of phonological
contrasts, vowels (e.g., /æ/ and /ɛ/) and word-initial stop voicing (e.g., /b/
vs. /p/).  I chose these two types of contrast for two reasons.  First, for
American English the primary acoustic-phonetic cues to vowel identity (formants)
and stop voicing (voice onset timing or VOT) are broadly thought to exhibit very
different patterns of variability across talkers and talker groups. Indeed,
there is at least qualitative evidence in support of this assumption. For
example, vowel formants in American English exhibit substantial variability
conditioned on the gender and the regional background of the talker
[@Peterson1952; @Hillenbrand1995; @Clopper2005; @Labov2006, among others].  On
the other hand, word-initial stop VOTs appear to be less variable across talkers
in American English. Specifically, cross-talker variation in voiceless
word-initial stop VOT is roughly _half_ of within-category variation: visual
`r label("vot-within-between-sd")`
inspection of Figure 1 in @Chodroff2015 suggests that the mean standard
deviation of /p/ is around 20ms, while the standard deviation of the mean VOT of
/p/ is less than 10ms (based on a range of 40ms).  Cross-talker
variability in vowel formants is approximately _double_ the within-category
variability [based on Figure 4 in @Hillenbrand1995]. This
qualitative difference, and the lack of direct apples-to-apples comparisons
between them, makes vowels and word initial stops an interesting
combination of contrasts to compare for the present purpose.

Second, while the overall level of talker variability for word-initial stop VOTs
is lower, there is some evidence that it is nevertheless _structured_ by age,
gender, and dialect, among other factors [@Torre2009; @Stuart-smith2015].  I
thus expect to find both 1) significant differences in the overall
informativity/utility of any socio-indexical variable when comparing across the
two types of contrasts (vowels and word initial stops), and 2) significant
differences in the informativity/utility within either contrast type when
comparing across socio-indexical variables.

For vowels, I further assess the consequences of _normalization_ on the
informativity/utility of different socio-indexical variables.  Vowel formants
vary based on physiological differences between talkers (e.g., the size of the
vocal tract), and there is evidence that vowel recognition draws on normalized
formants---transformations of the raw formant values that adjust for
physiological differences [e.g. @Lobanov1971; @Loyd1890; @Monahan2010; for
review, see @WeatherholtzInPress]. This approach allows us to compare the
informativity/utility of socio-indexical variables for raw vs. normalized vowel
formants.

The particular datasets I analyze here are drawn from three publicly available
sources: two collections of elicited vowel productions [@Heald2015; @Clopper2005]
and one of word-initial voiced and voiceless stops from unscripted speech
[@Nelson2017].[^r-data]  These sources were selected because they are annotated
for the acoustic-phonetic cues that are standardly considered to be the primary
cues to the relevant phonological contrasts (i.e., formants for the vowel
productions, voice onset timing for the stop productions), measured under
sufficiently controlled conditions to allow meaningful comparisons across
talkers, and contain enough tokens from multiple phonetic categories produced by
a sufficiently large and diverse population of talkers.  The last property is
particularly important for the goal of assessing the _joint_ statistical
contingencies between socio-indexical variables, linguistic categories, and
acoustic-phonetic cues.

[^r-data]: All three are available as R packages on Github:
    [`nspvowels`](https://github.com/kleinschmidt/nspvowels),
    [`healdvowels`](https://github.com/kleinschmidt/healdvowels), and
    [`votcorpora`](https://github.com/kleinschmidt/votcorpora) (which contains
    additional VOT measurements from other sources as well).

### Vowels

```{r nsp-data, cache=TRUE, results='hide'}

nsp_vows <- nspvowels::nsp_vows %>%
  ungroup() %>%
  mutate(Marginal='all',
         Dialect_Sex = paste(Sex, Dialect, sep='_'),
         Vowel=Vowel_ipa)

nsp_vows_lob <- nsp_vows %>%
  group_by(Talker) %>%
  mutate_at(c("F1", "F2"), funs(. %>%
                                  scale() %>% as.numeric())) %>%
  ungroup()

## check normalization
nsp_vows_lob %>%
  gather(formant, value, F1:F2) %>%
  group_by(Talker, formant) %>%
  summarise_at("value", funs(mean, sd)) %$%
  assert_that(all.equal(mean, rep(0, length(mean))),
              all.equal(sd, rep(1, length(sd))))

vowel_data <- data_frame(cues = c("F1×F2 (Hz)", "F1×F2 (Lobanov)"),
                         contrast = "Vowels (NSP)",
                         data = list(nsp_vows, nsp_vows_lob))
vowel_groupings <- c('Marginal', 'Sex', 'Dialect', "Dialect_Sex", 'Talker')
vowel_data_grouped <- apply_groupings(vowel_data, vowel_groupings)

token_per_vow <- nsp_vows %>% group_by(Talker, Vowel) %>% tally() %$% mean(n)
n_talkers <- nsp_vows %>% group_by(Talker) %>% summarise() %>% tally()

n_per_dialect_sex <- nsp_vows %>% group_by(Dialect, Sex, Talker) %>% summarise() %>% tally() %$% unique(n)
n_dialect <- nsp_vows %$% Dialect %>% unique() %>% length()

```

For vowels, I used two datasets.  The first is from the Nationwide Speech
Project [NSP; @Clopper2006b].  I analyzed first and second formant
frequencies (F1×F2, measured in Hertz) recorded at vowel midpoints in isolated,
read "hVd" words (e.g., "head", "hid", "had", etc.).  This corpus contains 
`r n_talkers` talkers, 
`r n_per_dialect_sex` male and female from each of 
`r n_dialect` regional varieties
of American English: North, New England, Midland, Mid-Atlantic, South, and West
[see map and summary of typical patterns of variation in @Clopper2005; regions
based on @Labov2006].  `r label("nsp-dialect1")`
Each talker provided approximately `r round(token_per_vow, 1)` repetitions of
each of 11 English monophthong vowels
/`r nsp_vows %>% pull(Vowel) %>% levels() %>% lift_dv(paste, sep=", ")()`/, 
for a total of `r nrow(nsp_vows)` observations.  Talkers were recorded in the
early 2000s, and were all of approximately the same age, so age-graded sound
changes are not likely to be detectable from this dataset.
`r label("age-graded")`

```{r heald-data}

# gender isn't marked but we can figure it out based on the formant values and
# the fact that there's three males and five females:
male_speakers <- healdvowels::by_speaker %>%
  unnest(map(model, ~ .$mu %>% as.list() %>% as.data.frame())) %>%
  group_by(Vowel) %>% 
  arrange(F1) %>% 
  filter(row_number() <= 3) %>%
  group_by(Speaker) %>% 
  summarise(n=n()) %>%
  arrange(n) %>%
  tail(3) %T>% print() %>%
  pull(Speaker)

#' Sample from a multivariate normal model
#'
#' @param n number of samples
#' @param model list with mu and Sigma (mean and covariance)
#'
#' @return a matrix of samples, with column names taken from the names of the mu
#'   vector.
#' 
r_model <- function(n, model) {
  x <- rmvnorm(n, mean=model$mu, sigma=model$Sigma)
  colnames(x) <- names(model$mu)
  x
}

# create gender models
heald_by_sex <- 
  healdvowels::by_speaker %>%
  mutate(Gender = ifelse(Speaker %in% male_speakers, "m", "f"),
         samples = map(model, r_model, n=1000)) %>%
  unnest(map(samples, as_data_frame)) %>%
  nest() %>%
  apply_groupings("Gender") %>%
  train_models_grouped(category_col = "Vowel",
                       cue_cols=c("F1", "F2"))

strip_f3 <- function(models) {
  models %>%
    mutate(model = map(model, update_list, mu = ~mu[1:2], Sigma = ~Sigma[1:2, 1:2]))
}

heald_models <- healdvowels::models %>%
  mutate(models = map(models, strip_f3)) %>%
  bind_rows(heald_by_sex) %>%
  mutate(cues = "F1×F2 (Hz)",
         contrast = "Vowels (HN15)") %>%
  filter(grouping %in% grouping_levels)


```

The second is from a study by @Heald2015.  Eight talkers (5 female and 3 male)
produced 90 repetitions of 7 monophthong American English vowels 
/`r healdvowels::by_speaker %>% pull(Vowel) %>% factor(levels=levels(nsp_vows$Vowel)) %>% unique() %>% sort() %>% lift_dv(paste, sep=", ")()`/
over 9 sessions.  Due to Human Subject Protocols, this dataset is only available
in the form of F1×F2 means and covariance matrices
for each category, conditioned on talker, gender, and the marginal distributions.
Unlike the NSP, the talkers recorded by @Heald2015 are all from the same
American English dialect region (Inland North), and so there is likely less
talker variability overall relative to the NSP talkers.

#### Vowel normalization

One main goal of this paper is to assess not just the degree but the _structure_ of talker
variability.  Much of the variability in vowel formants is due to physiological
differences between talkers' vocal tract size, which increase or decrease all
resonant frequencies together [@Loyd1890]. This produces global shifts in talkers' vowel
spaces, that apply relatively uniformly across all vowels. 
In contrast, sociolinguistic factors like dialect can affect the cue-category mapping
for individual vowels. Even gender-based differences in the cue-category mappings 
of vowels have been found to vary cross-linguistically, suggesting that they 
are partially stylistic [@Johnson2006].

In order to assess how much these category-general shifts contribute to talker
variability in vowel formant distributions, I analyze formant frequencies from
the NSP[^healdnorm] represented in
raw Hz, and also in Lobanov-normalized form.  Lobanov normalization z-scores F1
and F2 separately for each talker [@Lobanov1971], which effectively aligns each
talker's vowel space at its center of gravity, and scales it so they have the
same size (as measured by standard deviation). This controls for overall offset
in formant frequencies caused by varying vocal tract sizes (from both gender
differences and individual variation).  It does this while preserving the
structure of each talker's vowel space, so that (for instance) dialect-specific
vowel shifts are maintained, as we will see below.

`r label("r1-norm-methodological")`
Note that this is one of many possible normalization methods [see @Flynn2011;
@Adank2004],
and it is used here as a methodological tool, rather than a cognitive model of
how normalization might work itself.  The selection of this particular
normalization method was driven primarily by methodological constraints: it
provides good alignment of talker's overall vowel spaces, and does not require
additional cues that are not included in our data sources [like fundamental
frequencies and higher formants required by vowel-intrinsic normalization
methods @Flynn2011; @WeatherholtzInPress].  Normalization and learning
(adaptation) are often framed as _alternative_ models for how listeners cope
with talker variability, but they are not mutually exclusive
[@WeatherholtzInPress] and "hybrid models" may even be possible (as I briefly
discuss in the general discussion).

[^healdnorm]: Without access to the raw data, it is not possible to normalize the
    @Heald2015 vowels.

### Stop voicing

```{r vot-data, cache=TRUE}

vot <-
  votcorpora::vot %>%
  filter(source == 'buckeye') %>%
  rename(Talker = subject,
         Sex = sex,
         Age = age_group) %>%
  group_by(phoneme, Talker) %>%
  mutate(Token = row_number(),
         cues = 'VOT',
         contrast = "Stop voicing") %>%
  ungroup() %>%
  mutate(Marginal = 'all')

vot_by_place <-
  vot %>%
  group_by(place, cues, contrast) %>%
  nest()

vot_groupings <- c('Marginal', 'Sex', 'Age', 'Talker')
vot_by_place_grouped <- apply_groupings(vot_by_place, vot_groupings)

n_vot_talkers <- vot %>% group_by(Talker) %>% summarise() %>% nrow()
n_vot_per_talker <- vot %>% group_by(phoneme, voicing, place, Talker) %>% tally()

```

I also analyzed data on word-initial stop consonant voicing in conversational speech
from the Buckeye corpus [@Pitt2007; extracted by @Nelson2017;
@Wedel2018]. @Nelson2017 manually measured VOT for `r nrow(vot)` word initial,
stops with labial (/p,b/), coronal (/t,d/), or dorsal (/k,g/) places of
articulation.  Of these,
`r vot %>% filter(voicing=='voiced') %>% nrow()` 
were voiced and 
`r vot %>% filter(voicing=='voiceless') %>% nrow()` 
were voiceless. Data came from
`r n_vot_talkers` 
talkers, who were (approximately) balanced male and female and younger than
30/older than 40 years (Table @tbl:talkers-per-group). On average, each talker
produced
`r round(mean(n_vot_per_talker$n))` 
tokens for word-initial stop phoneme (range of 
`r min(n_vot_per_talker$n)` -- `r max(n_vot_per_talker$n)`).
@Nelson2017 excluded words with more than two syllables, function words, as well
as words that began an utterance, followed a filled pause, disfluency, or
another consonant.  They also excluded tokens with VOT or closure length "more
than 3 standard deviations from the speaker-specific mean for that stop"
[@Nelson2017 p. 8].  They did not, unlike many previous studies on VOT, exclude
words with complex onsets (a stop followed by a liquid or a glide).

In modeling VOT as a cue to voicing, I chose to model each place of articulation
separately.  This is because there is some variation in VOT as a result of place
of articulation, and treating, for instance, voiceless tokens from all three
places as coming from the same distribution could obscure talker-level variation
and bias the results against detecting talker- or group-level variation in VOT.
Moreover, VOT in English can vary as a result of speaking rate, both at the
level of the talker and individual tokens [@Sole2007].  In principle, it would
be interesting to investigate the effect of using normalized VOT.  However, in
order to meaningfully compare with normalized vowel formants investigated here,
a token-extrinsic (or talker-level) normalization procedure is needed, because a
token-intrinsic procedure would eliminate token-to-token variation in speaking
rate as well as overall talker effects, while the Lobanov normalization used for
vowels eliminates only talker-level effects.  Using a Lobanov-like z-scoring
technique may lead to artifacts because of the large differences in the variance
of voiced and voiceless distributions.  As a result, investigating the effect of
normalization on informativity and utility for voicing is left for future work.
`r label("vot-normalization")`

## Socio-indexical grouping variables

| Corpus         | Vowels (NSP)                  | Vowels (HN15)                   | Stop voicing (VOT)                |
|----------------+-------------------------------+---------------------------------+-----------------------------------|
| Marginal       | **1** group of **48 talkers** | **1** group of **8**            | **1** group of **24**             |
| Age            | N/A                           | N/A                             | 2 groups of 10 and 14             |
| Gender         | **2** groups of **24**        | **2** groups of **3** and **5** | **2** groups of **11** and **13** |
| Dialect        | **6** groups of **8**         | N/A                             | N/A                               |
| Dialect+Gender | **12** groups **4**           | N/A                             | N/A                               |
| Talker         | **48** groups                 | **8** groups                    | **24** groups                     |

Table: Socio-indexical variables analyzed here, and distribution of talkers
across groups in each corpus.  See below for more detail on each of the
corpora. {#tbl:talkers-per-group}

Based on the variables annotated in the available data, I consider cue
distributions for each phonetic category conditioned on the following
socio-indexical grouping variables, roughly in order of specificity (number of
talkers in each group):

* __Marginal__: control grouping, which includes all tokens for the
  category from all talkers. 
  This serves as a baseline against which more
  specific group distributions can be compared, and as a lower bound for speech
  recognition accuracy.
* __Gender__: coded as male/female for both vowels and stop voicing, allowing us to
  compare the role of gender-specific variation for two different contrasts.
* __Age__: coded as older than 40/younger than 30 for VOT (in the Buckeye corpus).  Not
  applicable to vowels, because the talkers are uniformly young by this cutoff.
* __Dialect__: the NSP contains data from talkers from six dialect regions (see
  below for details).  Not applicable to VOT or to vowels from @Heald2015.
* __Dialect+Gender__: @Clopper2006b found that gender modulates dialect differences, so
  I also examined cue distributions conditioned on dialect and gender together (12
  levels).
* __Talker__: for all corpora, talker-specific cue distributions serve as an
  upper bound on informativity and utility.

Note that when considering one socio-indexical grouping like age, this method
ignores other
grouping variables dialect, gender, or talker.  That is, when asking how
informative or useful the variable of age is, we are asking what a listener
would gain by knowing *only* the age (group) of an unfamiliar talker.

Next, I present two studies which apply the two measures of structure in talker
variability to these datasets.  First, I show how to assess the _informativity_
of these different grouping variables about the cue distributions themselves.
Then, I assess the _utility_ of these different grouping variables, in terms of
how they affect the accuracy of correct recognition.

# Study 1: How _informative_ are socio-indexical groups about vowel formant and VOT distributions?

```{r overlap-figure, fig.width=7.4, fig.height=5.6, fig.cap="Gender-specific distributions of vowel formants for /i/ appear to diverge from the overall (marginal) distributions (A), whereas for VOT the gender-specific distributions are essentially indistinguishable from the marginal distributions.  Intuitively, this makes gender informative for vowel formants, but not for VOT [see also vowels in @Perry2001; vs. VOT in @Morris2008].  The proposed approach formalizes this intuition in a quantitative measure that can be applied to directly compare talker variability across different cues, phonetic contrasts, and socio-indexical grouping variables.  Vowel data is drawn from the Nationwide Speech Project, and VOT from the Buckeye corpus (see below for more details)."}

marg_color <- "grey50"

vowel_p <-
  nspvowels::nsp_vows %>%
  filter(Vowel_ipa == "i") %>%
  mutate(Gender = forcats::fct_recode(Sex, Male="m", Female="f")) %>%
  ggplot(aes(x=F1)) +
  stat_density(data = .%>%select(-Gender), color=marg_color, fill=NA) +
  stat_density(aes(color=Gender), fill=NA, position="identity") +
  facet_wrap(~Gender) +
  lims(x = c(150, 550)) +
  theme(legend.position="hide",
        strip.background=element_blank(),
        strip.text=element_blank(),
        axis.line.y=element_blank(),
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        panel.grid.major.y=element_blank()
        ) +
  geom_text(
    data=tribble(
      ~Gender, ~F1, ~density, ~label,
      "Female",400, 0.0075,    "p(F1 | /i/, Female)",
      "Male",  330, 0.0075,    "p(F1 | /i/, Male)"),
    aes(color=Gender, y=density, label=label),
    vjust=1, hjust=0) +
  geom_label(data=data.frame(F1=420, density=0.003, label="p(F1 | /i/)"),
             aes(x=F1, y=density, label=label),
             color=marg_color, alpha=0.75, label.size=NA,
             hjust=0, vjust=1) +
  labs(x="First formant (F1, Hz)") +
  ggtitle(label="", subtitle="Distribution of F1 of /i/ by gender") +
  scale_color_manual(values = c("#FF20AF", "#00D2F7"))
## 
vot_p <-
  votcorpora::vot %>%
  filter(source=="buckeye",
         voicing == "voiceless") %>%
  mutate(Gender = forcats::fct_recode(sex, Male="m", Female="f")) %>%
  ggplot(aes(x=vot)) +
  stat_density(data = .%>%select(-Gender), color=marg_color, fill=NA) +
  stat_density(aes(color=Gender), fill=NA, position="identity") +
  facet_wrap(~Gender) +
  lims(x=c(0, NA)) +
  theme(legend.position="hide",
        strip.background=element_blank(),
        strip.text=element_blank(),
        axis.line.y=element_blank(),
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        panel.grid.major.y=element_blank()
        ) +
  geom_text(
    data=tribble(
      ~Gender, ~vot, ~density, ~label,
      "Female",80, 0.014,    "p(VOT | voiceless, Female)",
      "Male",  80, 0.014,    "p(VOT | voiceless, Male)"),
    aes(color=Gender, y=density, label=label),
    vjust=1, hjust=0) +
  geom_label(data=data.frame(vot=100, density=0.007, label="p(VOT | voiceless)"),
             aes(x=vot, y=density, label=label),
             color=marg_color, alpha=0.75, label.size=NA,
             hjust=0, vjust=1) +
  labs(x="Voice-onset time (VOT, ms)") +
  ggtitle(label="", subtitle="Distribution of voiceless stop VOT by gender") +
  scale_color_manual(values = c("#FF20AF", "#00D2F7"))
##
plot_grid(vowel_p, vot_p, ncol=1, align=TRUE, labels=c("A", "B"))


```

The first method I propose for assessing structure in talker variability is to
measure how _informative_ socio-indexical variables are about the
category-specific cue distributions.

One way to quantify how informative a socio-indexical grouping variable is about
cue distributions is by comparing the group-level cue distributions with the
_marginal_ distribution of cues from all groups. The reason for this is that if
a socio-indexical grouping variable (e.g., gender) is _not_ informative about
cue distributions, then the cue distributions for each group (e.g., male and
female talkers) will be indistinguishable from the overall "marginal" cue
distribution (e.g., Figure @fig:overlap-figure B).  If, on the other hand, a
socio-indexical variable _is_ informative about cue distributions, then the
distribution for each group will deviate substantially from other groups, and by
extension from the marginal distribution as well (Figure @fig:overlap-figure A).
The particular measure I use to compare distributions is the Kullback–Leibler
(KL) divergence.

This measure is intuitively similar to the proportion of variance explained by a
socio-indexical grouping variable [e.g., for gender and region in Dutch vowels
@Adank2004; for various contextual variables including talker in American
fricatives @McMurray2011a].
However, it is a more general approach that does not require that we assume that
the underlying distributions are normal distributions, and can be applied even
to categorical variables (like distributions of words or syntactic
structures). It also naturally extends to multidimensional cue spaces, taking
into account the correlations between cues, and supporting comparisons to other
cue spaces.

## Methods

The KL divergence is a measure of how much a probability distribution $Q$
diverges from a "true" distribution $P$.  In this case, the distributions are
over phonetic cues (VOT or F1×F2), and the "true" distribution is the
distribution conditioned on a socio-indexical variable (e.g., gender) while the
comparison distribution is the marginal distribution, which ignores any
socio-indexical grouping.

Intuitively, the KL divergence measures a loss of information when you use a
code optimized for $Q$ to encode values from $P$.  For instance, the frequencies
of letters in English sentences is very different from that of French sentences.
If we use a binary representation of letters that is optimized to make the
representation of French sentences as short as possible (while still
unambiguous), applying the same representation to English sentences will result
in longer forms than a code that is optimized for the frequencies of English
letters (and vice versa).  That difference is the KL divergence (measured in
bits) of the distribution of letters in French from that of English.  Similarly,
the code optimized for the marginal distribution of letters from both French and
English _combined_ will result in sub-optimal encoding for _both_ English and
French sentences, and the degree of sub-optimality provides a measure of how
much the language matters in understanding the distribution of letters.

Here, the KL divergence is used in an analogous way to measure the
_informativity_ of a socio-indexical variable (e.g., gender) with respect to
phonetic cue distributions (e.g., VOT).  Specifically, informativity is defined
as the KL divergence of the marginal distribution of phonetic cues (e.g.,
$p(\mathrm{VOT} | \mathrm{category})$) from each of the
socio-indexically-conditioned distributions (e.g., $p(\mathrm{VOT} |
\mathrm{category}, \mathrm{gender})$).

### Procedure

For each phonological category (e.g., /b/), I calculate the KL divergence of
each group's cue distribution (e.g., /b/-specific VOTs for male vs. female
talkers) from the marginal distribution of cues from all talkers (e.g.,
/b/-specific VOTs regardless of the talker's gender).  I then average across the
category-specific KL divergences for all phonological categories (e.g.,
/b,p,t,d,k,g/) to calculate the average KL divergence for that phonetic cue
(e.g., VOT) and group (e.g., male).  Finally, for each grouping variable, I
further average these group-specific divergences (e.g., male and female) to get
the overall informativity for the grouping variable (gender).

<!-- TODO: illustrate what 1 bit of KL divergence looks like -->

I average over categories for two reasons.  First, it's mathematically
convenient, because the KL between two normal distributions can be computed in
closed form, whereas for a mixture of multiple distributions it would have to be
estimated through computationally costly numerical integration.  Second,
averaging over categories naturally adjusts for differences in the number of
vowel (7--11) and stop voicing (2) categories.

The resulting informativity scores can be evaluated with a permutation test, by
randomly shuffling the group labels 100 times and repeating the calculation.
The resulting distribution of shuffled scores is an estimate of the _null_
distribution of informativity for the same cues, which controls for the number
of talkers in each group and the intrinsic properties of the cue distributions.
For the grouping variable of Talker, the labels are permuted by token; for all
other grouping variables they are shuffled at the talker level.

### Technical details

The KL divergence measures how much better the "true"
distribution predicts data that is actually drawn from that distribution than
the candidate distribution predicts it.  Mathematically, the KL divergence of
$Q$ from $P$ is defined to be
$$DL(Q||P) = \int p(x) \log \frac{p(x)}{q(x)} \mathrm{d}x$$ {#eq:kl}  
(with density functions $q$ and $p$ respectively).  The $\log p(x)/q(x)$ is how
much more (or less, if negative) probability $P$ assigns to a point $x$ than
$Q$.  The KL divergence is the average of this over all data that could be
generated by $P$, weighted by the probability that each $x$ would be generated
by $P$, $p(x)$.  The KL divergence increases as $Q$ diverges more from $P$, and
has a minimum value of zero, which is only achieved when $P=Q$, i.e., when the
two distributions are identical [@Mackay2003, p. 34].

In this case, $P=\mathcal{N}_G$ is a multivariate Normal cue
distribution conditioned on a socio-indexical group, with mean $\mu_G$ and
covariance $\Sigma_G$, while $Q=\mathcal{N}_M$ is the marginal (not conditioned
on group) cue distribution with mean $\mu_M$ and covariance $\Sigma_M$. With
some simplification,[^gaus-kl] the KL divergence of the marginal from the group
distribution works out to be 
$$ 
DL(\mathcal{N}_M || \mathcal{N}_G) = \frac{1}{2} \left(
\mathrm{tr}(\Sigma_M^{-1} \Sigma_G) + (\mu_M - \mu_G) \Sigma_M^{-1} (\mu_M -
\mu_G) - d + \log\frac{|\Sigma_M|}{|\Sigma_G|} \right) 
$$ {#eq:klnorm}
where $d$ is the dimensionality of the distribution (i.e., 1 for stop VOTs and 2
for vowel F1×F2).  The base of the logarithm in equation @eq:kl determines the
units.  For ease of interpretation, I report KL in bits, which corresponds to
using base-2 logarithms in equation @eq:kl and dividing equation @eq:klnorm by
$\log(2)$.

[^gaus-kl]: See, for instance,
    [http://stanford.edu/~jduchi/projects/general_notes.pdf](), p. 13. The math
    is the same for the univariate special case, as with VOT.

```{r kl-helpers, cache=TRUE}

#' Compute KL relative to reference grouping
#'
#' @param models a tbl containing models for each grouping level, with columns
#'   `grouping` (the grouping variable), `models` (a list column with tbls of
#'   trained models for each category × group combination, and any additional
#'   columns to be used to match up the reference grouping with the others
#'   (e.g., cue format, place of articulation for VOT, etc.)
#' @param reference_grouping the grouping level that's used to calculate the KL
#'   divergence of the other levels (e.g., Marginal).
#' @param category_col quoted name of the column that has the phonetic category
#'   in `models`
#'
#' @return a tbl with columns `grouping`, `group`, `reference_group`, the
#'   category col from the input, any other columns from the input, and `KL`.
#' 
run_kl <- function(models_grouped, reference_grouping, category_col) {
  assert_that(has_name(models_grouped, "models"),
              has_name(models_grouped, "grouping"))

  assert_that(reference_grouping %in% models_grouped[['grouping']])

  models_grouped %>%
    filter(grouping == reference_grouping) %>%
    mutate(reference_models = map(models, rename_, reference_group = 'group')) %>%
    select(-grouping, -models) %>%
    left_join(models_grouped %>% filter(grouping != reference_grouping)) %>%
    mutate(kl_from_reference = map2(models, reference_models,
                                    ~ left_join(.x, .y, by=category_col)%>%
                                      mutate(KL = map2_dbl(model.x, model.y,
                                                           KL_mods)) %>%
                                      select_('group', 'reference_group',
                                              category_col, 'KL')
                                    )
           ) %>%
    unnest(kl_from_reference)
    
}

```

```{r vowel-kl, cache=TRUE, dependson=c('kl-helpers', 'vowel-data')}

vowel_kl <-
  vowel_data_grouped %>%
  train_models_grouped(category_col = 'Vowel',
                       cue_cols = c('F1', 'F2')) %>%
  run_kl(category_col = 'Vowel',
         reference_grouping = 'Marginal') %>%
  filter(!is.na(KL))                    # NAs come from one vowel ('uh') that
                                        # only has one token for one talker.


```

```{r heald-vowel-kl, cache=TRUE, dependson=c('kl-helpers', 'heald-data')}

vowels_heald_kl <-
  heald_models %>%
  run_kl(reference_grouping = "Marginal",
         category_col = "Vowel")

```


```{r vot-kl, cache=TRUE, dependson=c('kl-helpers', 'vot-data')}

vot_kl <-
  vot_by_place_grouped %>%
  train_models_grouped(category_col = 'voicing',
                       cue_cols = 'vot') %>%
  run_kl(reference_grouping = 'Marginal',
         category_col = "voicing")

```


### Hypothesis testing by permutation test

To assess whether any particular KL divergence is different from chance, I
re-ran the same analysis on 1000 random permutations of the dataset, where
talkers are randomly re-assigned to groups (or tokens to talkers, for talker as
a grouping variable).  The permutation test $p$ value for a particular measure
is the proportion of these randomly permuted data sets that led to a value of
that measure that was as high or higher than the real assignment of talker to
groups (or tokens to talkers).  There are a number of advantages to this
technique for directly estimating the distribution of the test statistic
(informativity or KL divergence) under the null hypothesis that the assignment
of talkers to groups does not matter.  First, it controls for the differences in
group size.  For instance, in the NSP, there are 6 talkers per dialect, but 24
per gender.  Fewer talkers means that there will be fewer tokens per category,
which leads to more variable estimates and higher average diversion from the
marginal distributions.  Second, it accounts for the intrinsic asymmetry in KL
divergence, which is always greater than 0.  Third, it is flexible enough to
support arbitrary test statistics, including the grouping variable-level summary
score (average over groups), single-group score (averaged over phonetic
categories), and individual group-category scores (e.g., particular
dialect-vowel combinations).

```{r kl-permutation, cache=TRUE, dependson=c('kl-helper', 'vot-data')}

## need to control somehow for the number of talkers.  use random permutations
## of groupings, by shuffling talkers

#' Shuffle levels of a column in a tibble
#'
#' @param tbl Tibble with column to shuffle
#' @param group Quoted name (string) of group column to shuffle
#' @return tbl, with values of group column permuted
shuffle_group <- function(tbl, group) {
  labels <- groups(tbl)
  group <- sym(group)
  group_labels <-
    tbl %>%
    group_by(!! group, add=TRUE) %>%
    summarize() %>%
    ungroup() %>%
    mutate(!! group := sample(!! group))

  tbl %>%
    ungroup() %>%
    select(-one_of(map_chr(labels, as_string))) %>%
    left_join(group_labels, by=quo_name(group)) %>%
    group_by( !!! labels)
}

set.seed(1001)
n_perm <- 1000

batch_n <- function(x, n) {
  split(x, ceiling(seq_along(x) * n / length(x)))
}

# now run the whole KL pipeline, shuffling the talkers...
# can't do this easily on Heald data because don't have the underlying data...
vot_kls_perm <-
  map(
    batch_n(seq_len(n_perm), 20),
    ~ future(map(.x, function(perm_iter) {
      vot_by_place_grouped %>%
        filter(grouping != "Talker") %>%
        mutate(data = map(data, shuffle_group, group="Talker")) %>%
        train_models_grouped(category_col = "voicing",
                             cue_cols = "vot") %>%
        run_kl(reference_grouping = "Marginal",
               category_col = "voicing") %>%
        mutate(perm_iter = perm_iter)
    }))) %>%
  map(values) %>%
  flatten() %>%
  bind_rows()


vowel_kls_perm <-
  map(
    seq_len(n_perm),
    ~ future(vowel_data_grouped %>%
               filter(grouping != "Talker") %>%
               mutate(data = map(data, shuffle_group, group="Talker")) %>%
               train_models_grouped(category_col = 'Vowel',
                                    cue_cols = c('F1', 'F2')) %>%
               run_kl(category_col = 'Vowel',
                      reference_grouping = 'Marginal') %>%
               mutate(perm_iter = .x))) %>%
  map(values) %>%
  flatten() %>%
  bind_rows() %>%
  filter(!is.na(KL))                    # NAs come from one vowel ('uh') that
                                        # only has one token for one talker.


kl_perm_summary <-
  bind_rows(vot_kls_perm, vowel_kls_perm) %>%
  group_by(cues, contrast, grouping, perm_iter) %>%
  summarise(KL = mean(KL))


```

```{r talker-kl-perm, cache=TRUE, dependson=c('kl-helper', 'vot-data')}

#' Shuffle grouping variables within groups defined by another column
#'
#' @param tbl Grouped tibble
#' @param within Quote name (string) of column to define groups within which the
#'   native groups of tbl will be shuffled.
#' @return a copy of \code{tbl} with all grouping columns shuffled within each unique
#'   value of \code{within}.
#' 
shuffle_groups_within <- function(tbl, within) {
  gs <- groups(tbl)
  tbl %>%
    group_by(!! sym(within)) %>%
    mutate_at(map_chr(gs, as_string), sample) %>%
    group_by(!!! gs)
}

# permutate tokens within phonemes to do talker.

vowel_marginal_models <-
  vowel_data_grouped %>%
  filter(grouping == "Marginal") %>%
  train_models_grouped(category_col="Vowel",
                       cue_cols=c("F1", "F2"))

plan(multicore)

set.seed(1002)

vowel_kls_talker_perm <-
  map(seq_len(n_perm),
      ~ future(vowel_data_grouped %>%
                 filter(grouping == "Talker") %>%
                 mutate(data = map(data, shuffle_groups_within, within="Vowel")) %>%
                 train_models_grouped(category_col="Vowel",
                                      cue_cols=c("F1", "F2")) %>%
                 bind_rows(vowel_marginal_models) %>%
                 run_kl(reference_grouping = "Marginal",
                        category_col = "Vowel") %>%
                 mutate(perm_iter = .x))) %>%
  map(values) %>%
  flatten() %>%
  bind_rows()

vot_marginal_models <-
  vot_by_place_grouped %>%
  filter(grouping == "Marginal") %>%
  train_models_grouped(category_col="voicing",
                       cue_cols="vot")
       
vot_kls_talker_perm <-
  map(
    seq_len(n_perm),
    ~ future(vot_by_place_grouped %>%
               filter(grouping == "Talker") %>%
               mutate(data = map(data, shuffle_groups_within, within="voicing")) %>%
               train_models_grouped(category_col = "voicing",
                                    cue_cols = "vot") %>%
               bind_rows(vot_marginal_models) %>%
               run_kl(reference_grouping = "Marginal",
                      category_col = "voicing") %>%
               mutate(perm_iter = .x))) %>%
  map(values) %>%
  flatten() %>%
  bind_rows()

```


```{r kl-perm-summary}

kl_perm_summary <-
  bind_rows(vot_kls_perm, vowel_kls_perm, vot_kls_talker_perm, vowel_kls_talker_perm) %>%
  group_by(cues, contrast, grouping, perm_iter) %>%
  summarise(KL = mean(KL, na.rm=TRUE))  

```

## Results {#sec:kl-results}

<!-- TODO: move this summary to discussion? -->
I first report and discuss the broad patterns for the informativity of different
grouping variables in the three vowel and stop voicing databases described
above.  In short, the results show first that there is more talker variability
in vowels than stop voicing, and is reasonably consistent across two vowel
corpora.  Second, this talker variability is also _structured_ for vowels:
grouping talkers according to gender, dialect, or the combination thereof leads
to more informative groupings than random groupings of the same number of
talkers.  The same is not true for voicing.  Finally, I illustrate how the proposed
informativity measure captures well-documented dialectal variation in vowels.

```{r vowel-vot-kl-plot, dependson=c("grouping-and-colors", "heald-vowel-kl", "vot-kl", "vowel-kl", "kl-perm-summary"), fig.width=8.6, fig.height=6, fig.cap="Socio-indexical variables are more informative about cue distributions for vowel formants [HN15, @Heald2015; NSP, @Clopper2006b] than for stop voicing (VOT), even after Lobanov normalization.  On top of this, more specific groupings (like Talker and Dialect+Gender) are more informative than broader groupings (Gender).  Each open point shows one group (e.g., _male_ for _Gender_), while shaded points show the average over groups.  Gray violins show the null distribution of average informativity (KL estimated from 1000 datasets with randomly permuted group labels), and stars show significance of the variable's average KL with respect to this null distribution (`*`: $p<0.05$, `**`: $p<0.01$, `***`: $p<0.001$)."}

kl_by_group <-
  bind_rows(vot_kl, vowel_kl, vowels_heald_kl) %>%
  group_by(contrast, cues, grouping, group) %>%
  summarise(KL = mean(KL))

kl_summary <-
  kl_by_group %>%
  group_by(contrast, cues, grouping) %>%
  summarise(KL=mean(KL)) %>%
  left_join(kl_perm_summary %>%
              group_by(contrast, cues, grouping, perm_iter) %>%
              summarise(KL=mean(KL)),
            by=c("contrast", "cues", "grouping"),
            suffix = c("", "_perm")) %>%
  group_by(contrast, cues, grouping) %>%
  summarise(KL = unique(KL),
            perm_p = mean(KL <= KL_perm),
            n_perm = length(KL_perm)) %>%
  mutate(perm_p_stars = ifelse(is.na(perm_p), NA,
                               p_val_to_stars(perm_p))) %>%
  ungroup() %>%
  prettier_grouping() %>%
  pretty_contrast()

kl_by_group %<>%
  ungroup() %>%
  prettier_grouping() %>%
  pretty_contrast()

cues_plus_contrast <- function(d)
  mutate(d, cues_contrast = paste(cues, contrast, sep="\n"))

kl_summary %>%
  cues_plus_contrast() %>%
  ggplot(aes(x=grouping, y=KL, color=grouping)) +
  geom_violin(data = kl_perm_summary %>%
                ungroup() %>%
                prettier_grouping() %>%
                pretty_contrast() %>%
                cues_plus_contrast(),
              color=NA, fill="black", alpha=0.2, scale="width") +
  geom_quasirandom(data = kl_by_group %>% cues_plus_contrast,
                   alpha=0.2, size=2) +
  geom_point(size=4) +
  geom_text(aes(label = perm_p_stars), nudge_y = 0.25, color="black") +
  facet_grid(.~cues_contrast, scales='free', space='free') +
  # facet_grid(.~cues+contrast, scales='free', space='free') +
  rotate_x_axis_labs() +
  labs(x = "",
       y = 'KL Divergence of cue distributions\nfrom marginal (bits)',
       color = "") +
  theme(plot.title=element_text(hjust=0)) +
  scale_color_grouping() +
  scale_y_continuous(breaks=seq(0,6)) +
  theme(legend.position=c(1, .95),
        legend.justification=c("right", "top"),
        legend.background=element_rect(fill=gray(1, alpha=0.5))) +
  guides(color = guide_legend(label.position="left", label.hjust=1))

```

Figure {@fig:vowel-vot-kl-plot} shows the informativity of gender, dialect, and
talker identity, as measured
by the average KL divergence between cue distributions of each phonetic category
conditioned on these factors from the overall (marginal) cue distributions.
I make three observations.

```{r vowel-kl-same-dialect, cache=TRUE, dependson=c("nsp-data")}

nsp_kl_same_dialect <-
  vowel_data_grouped %>%
  filter(grouping %in% c("Talker", "Dialect")) %>%
  train_models_grouped(category_col="Vowel", cue_cols=c("F1", "F2")) %>%
  run_kl(category_col="Vowel", reference_grouping="Dialect") %>%
  left_join(nsp_vows %>%
              group_by(Talker, Dialect) %>%
              summarise(),
            by=c(group="Talker")) %>%
  filter(reference_group == Dialect)

talker_summary <-
  . %>%
  filter(grouping=="Talker") %>%
  group_by(group)%>%
  summarise(KL=mean(KL)) %>%
  do(daver::boot_ci(.$KL, function(d,i) mean(d[i], na.rm=TRUE))) %$%
  sprintf("(%.1f bits, 95%% CI [%.1f--%.1f])", observed, ci_lo, ci_high)

talker_kl_strs <-
  list(same=nsp_kl_same_dialect,
       marg=vowel_kl,
       heald=vowels_heald_kl) %>%
  map_chr(talker_summary)

```


First, there are major differences in talker variability between vowels and stop
consonant voicing: talker identity is an order of magnitude more informative
about vowel distributions than about VOT distributions.[^two-cues]  That is, knowing a
talker's identity provides significantly more information about their vowel
formant frequency distributions than it does about their VOT distributions.
This quantitatively confirms the qualitative understanding that there is less
talker variability in VOT than in formant frequencies [e.g., @Allen2003;
@Lisker1964; vs. @Peterson1952; @Hillenbrand1995].  Strikingly, the _most_
informative variable for VOT---talker identity---is roughly as informative as
the _least_ informative variable for any vowels (Gender for Lobanov-normalized
F1×F2).

[^two-cues]: This is true even when considering just F1 or F2 in isolation.  The
    KL divergence for distributions of two, independent cues is the sum of the
    independent cues.  For vowels, the F1×F2 informativity is approximately
    equal to the sum of the individual F1 and F2 informativities.

Across the two vowel corpora, the level of talker variability appears to be
lower in the HN15 data than the NSP data, but not so low as in the VOT data.
One possible explanation of this discrepancy is that all the HN15 talkers are all from the
same dialect region, while @Clopper2006b intentionally recruited talkers to
demonstrate dialect variability.  And, indeed, individual NSP talkers'
distributions diverge less from the corresponding dialect distributions
`r talker_kl_strs["same"]`
than they do from the marginal distributions 
`r talker_kl_strs["marg"]`.
But this divergence is still substantially more than the average for the HN15 talkers 
`r talker_kl_strs["heald"]`, 
suggesting that this is not the only explanation.  Another possibility is that
because of the smaller number of tokens from each NSP talker means that the
individual talker distribution estimates are noisy.  Unfortunately, without
access to the underlying single-token F1×F2 values for HN15 talkers it is
difficult to assess this.

The informativity of gender, on the other hand, is similar across the two
datasets.  This suggests that the size of these datasets is sufficient to
replicate estimates of informativity of gender.

Second, with one notable exception, I find that grouping variables with fewer
talkers are more informative than groupings with more talkers: talker identity
is the most informative, followed by (for NSP vowels) dialect+gender and
dialect, then gender and age.  The one exception is that for _un_-normalized
formants, gender is substantially more informative than dialect is, even though
it is one of the most general grouping variables, with each group including half
the talkers.  This is to be expected: gender differences (either stylistic or
physiological, like vocal tract length) change formant frequencies for all
vowels by large amounts [@Johnson2006], while dialect variation is limited to
certain dialect-vowel combinations [@Clopper2005; @Labov2006].

My third observation is about the effect of normalization. As is to be
expected, Lobanov normalization substantially reduces the informativity of
gender.  This is, after all, one of the *purposes* of normalization---to remove
differences between male and female vowel distributions that are due to the
overall shifts in formant frequencies.  However, gender still carries
some information about Lobanov-normalized vowel distributions. This is in
line with previous observations that Lobanov normalization---while among the
most effective normalizations---is not perfect [e.g., @Escudero2007;
@Flynn2011].  Additionally, there is still substantial talker variability even
in normalized vowel distributions. This and the non-zero informativity of gender
support arguments against (Lobanov) normalization as the sole mechanism by which
listeners overcome talker variability [see @Johnson2005 for discussion].

Finally, even for normalized vowel distributions, the informativity of dialect
and gender together is still higher than the informativity of dialect alone.
This suggests that dialect differences themselves are modulated by gender [as
noted by @Clopper2005].

### Informativity and dialect variation

One advantage of the proposed measure of informativity is that it can assess
whether a grouping variable is equally informative about all categories, or
whether a particular grouping is particularly informative about
specific types of categories.  Figure {@fig:vowel-vot-kl-plot} illustrates this
for vowel, compared to stop, categories.  In this section, I show how this same
approach can be used to
investigate difference in the informativity of a grouping factor for different
_vowels_.  This provides a principled quantitative measure of, for example,
vowel-specific dialectal variation.  If factors
like dialect are differentially informative about the distributions of some
vowels versus others, then listeners may track dialect-specific distributions
for only some vowels and not for others.

`r label("nsp-dialect2")`
As Figure {@fig:vowel-kl-by-category} shows, informativity varies quite a bit by
vowel.  Dialect (and Dialect+Gender) is particularly informative for /ɑ/, /æ/,
/ɛ/, and /u/, vowels with distinctive variants in at least one of the dialect
regions from the NSP [see @Clopper2005 for a summary of variation in American
English vowels across these dialect regions].  These results are consistent with
what has been noted in the sociolinguistic literature [e.g., @Labov2006]: /ɑ/ is
merged with /ɔ/ in some regions, /æ/, /ɛ/, and /ɑ/ participate in the Northern
Cities Chain Shift, and /u/ is fronted in some regions [and in others only by
female talkers; @Clopper2005].

```{r vowel-kl-by-category, dependson=c("grouping-and-colors"), fig.width=7, fig.height=3.45, fig.cap='Individual vowels vary substantially in the informativity of grouping variables about their cue distributions.  Only normalized F1×F2 is shown to emphasize dialect effects.  Large dots show the average over dialects (+genders), while the small dots show individual dialects (+genders) (see Figure {@fig:vowel-kl-by-dialect} for detailed breakdown of individual dialect effects).  The grey violins show the vowel-specific null distributions of the averages, estimated based on 100 datasets with randomly permuted dialect (+gender) labels, and stars show permutation test p value (proportion of random permutations with the same or larger KL divergence), with false discovery rate correction for multiple comparisons [@Benjamini1995].'}

vowel_kl_by_dialect_vowel <- 
  vowel_kl %>%
  filter(str_detect(cues, 'Lobanov'),
         str_detect(grouping, "Dialect")) %>%
  prettier_grouping() %>%
  group_by(cues, Vowel, grouping, group) %>%
  summarise(KL = mean(KL)) 

vowel_kl_perm_by_dialect_vowel <-
  bind_rows(vowel_kls_perm, vowel_kls_talker_perm) %>%
  filter(str_detect(cues, "Lobanov"),
         str_detect(grouping, "Dialect")) %>%
  prettier_grouping() %>%
  group_by(cues, Vowel, grouping, perm_iter) %>%
  summarise(KL=mean(KL, na.rm=TRUE))

vowel_kl_by_dialect_vowel_summary <- 
  vowel_kl_by_dialect_vowel %>%
  summarise(KL = mean(KL)) %>%
  left_join(vowel_kl_perm_by_dialect_vowel,
            by = c("cues", "Vowel", "grouping"),
            suffix = c("", "_perm")) %>%
  group_by(cues, grouping, Vowel) %>%
  summarise(perm_p = (sum(KL <= KL_perm) + 0.5) / (length(KL) + 1),
            KL = mean(KL)) %>%
  mutate(perm_p_fdr = p.adjust(perm_p, method="fdr"),
         perm_p_lt = p_val_to_less_than(perm_p_fdr, cutoffs = c(0.05, 0.01)),
         perm_p_star = p_val_to_stars(perm_p_fdr))

vowel_kl_by_dialect_vowel %>%
  ggplot(aes(x=Vowel, y=KL, color=grouping)) +
  geom_point(alpha=0.2, position = position_jitter(w=0.2),  show.legend=FALSE) +
  geom_point(stat='summary', fun.data=mean_cl_boot, size=2,
             ## position = position_dodge(w=0.5),
             show.legend=FALSE) +
  geom_text(data= vowel_kl_by_dialect_vowel_summary,
            aes(label = perm_p_star), color="black", nudge_y = 0.1) +
  labs(x = 'Vowel',
       y = 'KL Divergence (bits)') +
  ggtitle("Informativity by vowel",
          subtitle="Lobanov-normalized F1×F2") +
  theme(plot.title=element_text(hjust=0)) +
  geom_violin(data=vowel_kl_perm_by_dialect_vowel,
              scale="width", fill="black", color=NA, alpha=0.2) +
  facet_grid(.~grouping) +
  ylim(0, NA) +
  scale_color_grouping()

```

Figure @fig:vowel-kl-by-dialect shows the informativity by vowel and dialect
individually.  This shows that dialects do indeed vary in how informative they
are, both overall (left) and by vowel (right).  Some of this variability
corresponds to known patterns of dialect variability. In particular, talkers
from the North dialect region produce vowels---/æ/ and /ɑ/ in particular---with
formant distributions that deviate markedly more from the marginal distributions
(across all dialects) than any of the other dialects.  Both of these vowels
participate in the Northern Cities Shift, and in a sense are foundation of this
shift, being at the root of the Northern Cities Shift's implicational hierarchy
[@Labov2006; @Clopper2005].  `r label("implicational-hiearchy")`
The Mid-Atlantic /ɑ/ is, like the Northern /ɑ/,
non-merged with /ɔ/ [@Clopper2005] and hence deviates from the marginal /ɑ/
substantially.  New England talkers produce a low-variance /u/ distribution with
a lower mean F1 than other dialects, which may reflect a lack of /u/-fronting
and is consistent with a conservative /u/ in New England [@Labov2006].[^rory]

[^rory]: Thanks to Rory Turnbull for suggesting this interpretation.

```{r vowel-kl-dialect-summaries, dependson=c("grouping-and-colors")}
dialect_lob_kl_perm_vals <- 
  vowel_kls_perm %>%
  filter(grouping == "Dialect",
         str_detect(cues, "Lobanov")) %>%
  pull(KL)

vowel_dialect_combos_kl_summary <-
  vowel_kl %>%
  filter(grouping=='Dialect',
         str_detect(cues, "Lobanov")) %>%
  select(group, Vowel, KL) %>%
  mutate(perm_p = map_dbl(KL, ~ mean(.x <= dialect_lob_kl_perm_vals)),
         stars = p_val_to_stars(perm_p),
         perm_p_fdr = p.adjust(perm_p, "fdr"),
         stars_fdr = p_val_to_stars(perm_p_fdr)) %>%
  group_by(Vowel) %>%
  mutate(min_perm_p_fdr = min(perm_p_fdr))

dialect_mean_kl_perm <-
  vowel_kls_perm %>%
  filter(grouping == "Dialect",
         str_detect(cues, "Lobanov")) %>%
  group_by(cues, contrast, grouping, group, perm_iter) %>%
  summarize(KL_perm = mean(KL))

dialect_mean_kl_perm_p <-
  dialect_mean_kl_perm %>%
  left_join(vowel_kl %>% group_by(cues, contrast, grouping, group) %>% summarize(KL=mean(KL))) %>%
  summarize(perm_p = (sum(KL <= KL_perm)+1)/(length(KL)+1),
            KL = mean(KL)) %>%
  mutate(perm_p_fdr = p.adjust(perm_p, "fdr"),
         stars = daver::p_val_to_stars(perm_p_fdr))

```

`r label("r1-svs")`
The only particularly high divergence identified as significant by permutation
test that does not correspond to known sociolinguistic variants is Mid-Atlantic
/e/, which is slightly higher and fronter than the marginal distribution.  There
are also well-documented dialect effects that appear to be missing from these
results.  For instance, none of the individual vowels involved in the Southern
Vowel Shift---/i, ɪ, e, ɛ, o, u/---diverge from the marginal distributions
reliably.  However, as Figure @fig:vowel-kl-by-dialect (left) shows, the entire
vowel space of Southern speakers _does_ diverge from the marginal distributions,
suggesting that even though the individual vowels do not differ dramatically
from marginal, the combination of subtle differences is in fact reliable across
talkers.  Moreover, the individual vowels that diverge the most for Southern
speakers are /ɛ, u, ɔ, o, e/, all of which (except /ɔ/, which likely reflects
the lack of the caught-cot merger) are associated with the Southern Vowel Shift
by @Clopper2005 and all of which are significant before correcting for multiple
comparisons (except for /ɛ/,
$`r filter(vowel_dialect_combos_kl_summary, group=="South", Vowel=="ɛ") %>% pull(perm_p) %>% p_val_to_less_than()`$).
Also, the lack of reliable evidence for individual Southern Vowel Shifts is
consistent with the results from @Clopper2005 using the same data: mean F1 and
F2 for Southern speakers for these vowels were not found to consistently differ
significantly from the other dialects or the overall means (although there were
some combinations of gender and dialect that did yield significant differences).


```{r vowel-kl-by-dialect, dependson=c("vowel-kl-dialect-summaries"), fig.width=10, fig.height=4.5, fig.cap='Breaking down the overall informativity of dialect by individual dialects (left) and dialect-vowel combinations (right).  Some dialects are more informative about Lobanov-normalized vowel distributions than random groupings of the same number of talkers (grey violins), but some are not (at least in the current sample of talkers).  Likewise for individual vowels within dialects.  Moreover, dialects be informative on average but not have any individual vowels that are informative alone (e.g., South), and vice-versa (e.g., Midland).  Stars show $p$ values from permutation test (`*`: $p<0.05$, `**`: $p<0.01$, `***`: $p<0.001$) corrected for false-discovery rate across all dialects/dialect-vowel combinations [@Benjamini1995].'}

p_dialect_vowel <-
  vowel_dialect_combos_kl_summary %>%
  ggplot(., aes(x=group, y=KL)) +
  geom_line(aes(group=Vowel, color=Vowel, alpha=min_perm_p_fdr < 0.05)) +
  geom_text(data = . %>% filter(perm_p_fdr < 0.05),
            aes(label = Vowel, color=Vowel), show.legend=FALSE,
            nudge_x = 0.28, nudge_y=0.05) +
  geom_text(aes(label = stars_fdr, color=Vowel), nudge_y=0.05, show.legend=FALSE) +
  rotate_x_axis_labs() +
  scale_alpha_manual(values=c(0.3, 1), guide=FALSE) +
  labs(x = NULL,
       y = NULL) +
  ggtitle("Informativity by dialect and vowel") +
  theme(plot.title=element_text(hjust=0)) +
  geom_violin(data = vowel_kls_perm %>%
                filter(grouping == "Dialect",
                       str_detect(cues, "Lobanov")),
              aes(group=group), color=NA, fill="black", alpha=0.2) +
  lims(y=c(0,2)) +
  theme(axis.text.y = element_blank(),
        axis.line.y = element_blank(),
        axis.title.y= element_blank(),
        axis.ticks.y= element_blank()) +
  # c11_2 from notebook:
  # distinguishable_colors(11, lchoices = [60, 70], cchoices = [100], hchoices = linspace(0, 300, 20))
  scale_color_manual(values = c("#FF0095", "#00C928", "#00B0FF", "#FF8500",
                                "#A39200", "#00D2C7", "#7A78FF", "#FF1D3A",
                                "#00B37B", "#00D2F7", "#FF4D00"))

p_dialect <-
  ggplot(data = dialect_mean_kl_perm, aes(x=group, y=KL)) +
  geom_violin(data = dialect_mean_kl_perm, aes(group=group, y=KL_perm),
              color=NA, fill="black", alpha=0.2) +
  geom_line(data = dialect_mean_kl_perm_p, color="black", group=1) +
  geom_text(data = dialect_mean_kl_perm_p, aes(label=stars), color="black", nudge_y = 0.05) +
  lims(y=c(0,2)) +
  labs(x = NULL,
       y = 'KL divergence (bits)') +
  ggtitle("Informativity by dialect",
          subtitle="Lobanov-normalized F1×F2") +
  theme(plot.title=element_text(hjust=0)) +
  rotate_x_axis_labs()


plot_grid(p_dialect, p_dialect_vowel, rel_widths = c(1, 1.1), axis="l", align="hv")

```

This asymmetry in informativity across both dialects and vowels raises the
question of how listeners adapt to variation across categories and cue
dimensions. All else being equal, a listener should be more confident in their
prior beliefs about a category that varies less across talkers, and hence adapt
less flexibly [@Kleinschmidt2015]. But it is not clear at what level listeners
track variability for the purpose of determining how quickly to adapt. For
instance, as we have seen, vowels overall vary substantially more across talkers than
stop categories, but there are differences in how much individual vowels
vary. It remains to be seen whether listeners adapt to all vowels with the same
degree of flexibility, or are sensitive to these vowel-specific differences in
cross-talker variability.

## Discussion

`r label("discussion1")`

The measure of informativity I have proposed here quantifies the _amount_ and
_structure_ of talker variability using an information-theoretic measure of how
much talker- or group-specific cue distributions diverge from the overall
(marginal) distributions.  This measure allows talker variability for different
phonetic categories, and even different cues, to be compared directly.  As a
proof of concept, the results here quantify previous qualitative
findings[^qualitative] that in American English there is an order of magnitude
less talker variability in the realization of word-initial stop voicing than in
vowels.  Moreover, there are qualitative differences in the _structure_ of this
variability: gender is no more informative about VOT distributions than random
groupings of talkers of the same size, while gender-specific F1×F2 distributions
are reliably more informative than random groupings.

Informativity also allows fine-grained investigation of dialect variation.  The
same measure can be applied at the level of individual phonetic categories
(e.g., vowels, @fig:vowel-kl-by-category), groups (e.g., a particular dialect),
or even particular combinations the two (as in Figure @fig:vowel-kl-by-dialect).
This measure takes into account the entire distribution of cues, and so it is
more comprehensive than standard statistical techniques like regression or
ANOVAs which usually compare the _mean_ values of particular cues across
groups or categories [for a comparable analysis of the NSP data, see
@Clopper2005].

The usefulness of this measure does not come at the expense of grounding in
first principles: it corresponds directly to the amount of information that a
listener leaves on the table if they ignore a grouping variable (including
talker identity) and treat all tokens of a phonetic category as generated from
the same underlying distribution.  Ideal listener models [@Clayards2008;
@Norris2008; @Feldman2009] identify knowledge of these distributions as a fundamental
constraint on accurate and efficient speech perception.  Furthermore, the ideal
adapter model [@Kleinschmidt2015] motivates the use of talker- or group-specific
cue distributions as a constraint on the ability of listeners to effectively
generalize from previous experience: if a grouping variable like talker identity
or gender is not informative about cue distributions, then there is little
possible benefit to tracking group-specific distributions.  However, just because a
grouping variable _is_ informative about cue distributions does not necessarily
mean that tracking those group-specific distributions leads to any benefit for
recognizing a talker's intended category.
This motivates the notion of _utility_, investigated in study 2.


[^qualitative]: I refer to previous evidence for more talker variability in
    vowels than stop voicing as "qualitative" because no attempt has been made
    to measure talker variability in a directly comparable way across the two
    systems, even though there have been quantitative measurements of talker
    variability in each system.

# Study 2: How _useful_ are socio-indexical groups for recognizing vowels and stop voicing

`r label("r2-intuitive-utility")`
The results of Study 1 show that socio-indexical variables like age, gender,
dialect, and talker identity are informative about phonetic cue distributions.
That is, the category-specific distributions of acoustic-phonetic cues are
reliably different for differing values of at least some socio-indexical
variables.  However, these differences in cue distributions do not necessarily
correspond to differences in the ability to recover a talker's intended phonetic
category.  Even if there is some structure in talker variability for listeners
to learn, that learning might not be useful for speech recognition.

This motivates the notion of _utility_ that I develop and explore in Study 2.  Where
informativity concerns how well a listener could probabilistically predict the
_cues_ themselves, utility measures how well a listener could use those cue
distributions to infer a talker's intended phonetic _category_.
A socio-indexical variable must be informative for it to be more useful than the
overall, marginal distributions, but the
converse is not necessarily true.  For example, if talkers vary in a way that
does not lead the marginal distributions of different phonetic categories to
overlap much more than for individual talkers, then the inferences that an ideal
listener would draw based on the marginal distributions are essentially the same
as from the talker-specific distributions.

## Methods

The _utility_ of a socio-indexical grouping variable is defined based on how
often an ideal listener would successfully recognize
a talker's intended category, given cue distributions estimated based on a
particular group of talkers ($g$).  Specifically, I use the posterior
probability of the talker's intended category $c_\mathrm{intended}$ given the cue
value they actually produced $x$.[^decision-rule]  This, in turn, depends on the cue
distributions produced by group $g$ as described by Bayes' rule:
$$
p(c=c_\mathrm{intended} | x, g) \propto
    p(x|c=c_\mathrm{intended}, g)p(c=c_\mathrm{intended})
$$
Bayes Rule can, with some algebra, be restated as an equality of odds ratios:
$$
\frac{p(c=c_\mathrm{intended} | x, g)}{p(c \ne c_\mathrm{intended} | x, g)} =
    \frac{p(x|c=c_\mathrm{intended}, g)}{p(x|c \ne c_\mathrm{intended}, g)}
    \times
    \frac{p(c=c_\mathrm{intended})}{p(c \ne c_\mathrm{intended})}
$$
Like the standard form of Bayes Rule, this has a straightforward interpretation:
the _posterior odds_ of correctly recognizing $c=c_\mathrm{intended}$ are the
_prior odds_ times the likelihood ratio, or how much more likely it is that $x$
was generated by the true category $c_\mathrm{intended}$ than all the other
categories combined.  If the likelihood ratio is greater than 1, then we have
gained evidence in favor of the true category; and if it is less than 1, we have
gained evidence in favor of an erroneous category.  This interpretation holds
regardless of whether there is contextual information that favors one category
over another, which would only change the prior odds.  It is also not sensitive
to the number of categories, which would also manifest in changes in the prior
odds.  Moreover, if we take the logarithm of both sides, the prior and
likelihood log-odds ratios _add_ together to produce the posterior
log-odds.[^additional-cues] Thus, the log-likelihood ratio
$$
\log\left(\frac{p(x|c=c_\mathrm{intended}, g)}{p(x|c \ne c_\mathrm{intended},
g)}\right)
$$
provides a measure of the information gained[^information] about what a talker
is trying to say by interpreting cues $x$ using the category-specific cue 
distributions of group $g$, rather than a category-specific cue distribution.  It can be calculated from the posterior probability
of the correct (talker's intended) category relative to chance:
\begin{align}
\log\left(\frac{p(x|c=c_\mathrm{intended}, g)}{p(x|c \ne c_\mathrm{intended},
g)}\right) &= 
\log\left(\frac{p(c=c_\mathrm{intended} | x, g)}{p(c \ne c_\mathrm{intended} | x,
g)}\right) - 
\log\left( \frac{p(c=c_\mathrm{intended})}{p(c \ne c_\mathrm{intended})} \right) \\ 
&= \log\left(\frac{\mathrm{accuracy}}{1-\mathrm{accuracy}}\right) - \log\left(
\frac{\mathrm{chance}}{1-\mathrm{chance}} \right)
\end{align}

[^decision-rule]: An ideal observer's actual _responses_ (and thus its accuracy)
    in, e.g., a phonetic classification task additionally depend on the decision
    rule (or loss function).  However, any reasonable decision rule will be
    constrained by the amount of evidence in favor of the talker's intended
    category, and so the posterior probability of that category is a reasonable
    proxy for the current purpose.  Also, note that using a winner-take-all
    decision rule with likelihood derived from normal distributions is the same
    as quadratic discriminant analysis, as used for instance by @Adank2004 in
    assessing the effectiveness of various vowel normalization techniques.

[^additional-cues]: This is true even in the presence of additional (independent)
    cues.

[^information]: This quantity is not exactly information in the
    information-theoretic sense because it's not weighted by the probability of
    observing cue $x$ under the true category model.

By comparing information gain from different groups' distributions, we can
estimate the _utility_ of these different groupings.  For instance, we can ask
how much additional information is gained by knowing a talker is male by looking
at the information gain from cue distributions estimated from other male
talkers ($g=\mathrm{male}$), compared to all male and female talkers together
($g=\mathrm{all}$).  The same approach can also address changes in the _prior_
probability of a category based on socio-indexical variables (e.g., higher
or lower frequency of voiced stops in a particular dialect)

Talker-specific cue distributions ought to provide the most information about a
talker's own productions, and the marginal cue distributions (over all talkers)
the least.  The difference between them, though, depends on the amount of talker
variability. I expect other groupings to yield information gains that are somewhat less than
talker-specific distributions, but more than marginal distributions.  Where
exactly between these extremes is a measure of how much utility there is in
tracking group-specific cue distributions: if a listener gains just as much
information about what a talker was trying to say by using cue distributions
based on other talkers of the same gender, age, dialect, etc., then there is little
need to learn talker-specific cue distributions.
<!-- TOOD: maybe for general discussion? --> 
Where the informativity of a particular grouping (Study 1) measures how much
there is to learn about group-specific distributions, the utility of the
grouping (Study 2) measures how much benefit a listener would gain from doing
that learning.

For vowels, I classified vowel categories directly. For voicing, the only cue
`r label("r1-only-cue-vot")`
available in this dataset is VOT, which does not (reliably) distinguish place of
articulation. Thus, I classified voicing separately for each place of
articulation, and then average the resulting accuracy.

### Assumptions

Utility measures the maximum possible improvement in the accuracy of speech
perception that is possible under the specific set of assumptions made in the
ideal observer model.  One particularly important assumption that this method
makes is that the listener _knows_ the socio-indexical group for a talker.  I
make this assumption for two reasons.

First, in many cases listeners do, in fact, have a good deal of socio-indexical
information about a talker.  This may come from non-linguistic cues (or world
knowledge), or even from other linguistic features that the listener produces
[@KleinschmidtInPress2017].  Moreover, this assumption is not inherent in the
method I propose, and it is possible to simultaneously infer the intended
category _and_ the socio-indexical group.  In preliminary simulations, defining
utility in this way has surprisingly little effect on the results, but it makes
the simulations substantially more computationally demanding.

Second, and more importantly, I define utility assuming that the socio-indexical
group of a talker is known because this provides an estimate of the
_in-principle_ benefit of tracking group-specific phonetic cue
distributions.[^startingpoint]  This is a defining feature of rational analyses
of cognition [for the value of such clearly defined, in-principle bounds on
performance, see also @Massaro1990].

[^startingpoint]: This benefit is for first
    encountering a novel talker from a socio-indexical group *prior to further
    adaptation*.  In the general discussion, I return to this point and why the
    utility measure might *underestimate* the benefit of implicit knowledge
    about group-specific category distributions, as this knowledge likely serves
    as the starting point for talker-specific adaptation [@Kleinschmidt2015].


### Procedure

The utility of a grouping variable (e.g., gender) is calculated by first
calculating the utility of that variable for each talker, which is done as
follows.  First, a training data set is constructed.  For the NSP data, this was
done by sampling three other talkers from the same group (e.g., three other male
talkers).  This subsampling is done to avoid biases in accuracy from group size,
since groups with fewer talkers have more unstable estimates of their cue
distributions, and lower accuracy on average [see @James2013, Section 2.2.2].
The most specific grouping in the NSP is Dialect+Gender, which has four talkers
per group.  Including the talker's own test data in the training data set will
also artificially increase accuracy [@James2013, Section 5.1], so three talkers
are used to form the training set.  For the other datasets, all other talkers
from the same group were used for the training set, since the VOT data is
(approximately) balanced by age and gender, and the HN15 data only groups talkers by gender.
Based on this training data set, category-specific distributions are estimated
in the same way as study 1, using the unbiased estimators of the mean and
(co-)variance of the tokens from each phonetic category.

Second, the overall accuracy for the test talker is determined in the following
way.  Bayes rule is used to compute the posterior probability of the talker's
intended category for each of the tokens produced by the test talker, using the
likelihood functions of each category from the training data.  The mean of these
posterior probabilities is the talker's overall accuracy.

Third, and finally, the accuracy $p$ for each talker is converted to utility by
transforming to log-odds $\log(p/(1-p))$ and subtracting the log-odds of
responding correctly by guessing uniformly, which is $\log(1/(n-1))$ if there
are $n$ response options.  The overall utility of the grouping variability is
the mean of these talker-specific utilities.

Because the training sets are sampled at random for groupings except for
dialect+gender, the whole procedure is repeated 100 times and averaged at the
level of talker-specific utility to obtain more reliable results.  For talker as
a grouping-variable, six-fold cross-validation was used instead, where each
talker's tokens were divided up into 6 roughly equal partitions (within
category).  The accuracy for each partition's tokens was determined using the
other 5 as training data.

Bootstrap resampling was used to estimate the reliability of these estimates.
1000 simulated populations of talkers were sampled with replacement, and the
average utility for each grouping variable, and the differences between them,
were re-computed each time.  The reliability of differences between, for
example, the utility of dialect and gender can be estimated in this way by
looking at how frequently the resampled populations result in a difference in
utility between dialect and gender with the same sign as the real sample of
talkers.  This is similar to a paired $t$-test but does not assume that talkers'
utilities are normally distributed.

Because the vowel corpus from @Heald2015 only includes summary statistics, I
computed utility based on a sample 100 F1×F2 pairs per category for each talker.

Differences in the composition of these corpora mean that care must be taken in
making comparisons _across_ corpora.  The group-size bias is especially
problematic when looking for subtle effects of groupings with small sample
sizes, like dialect or dialect+gender (which contain 8 and 4 talkers per group,
respectively).  The subsampling procedure results in changes in accuracy of only
a few percentage points, but doesn't change the overall order of magnitude.
Thus, gross, qualitative comparisons across corpora are still reasonable, even
if fine-grained comparisons are not.

```{r classification-helpers, cache=TRUE}

## 1. Likelihood of each token under each vowel for each dialect model
#' Compute posterior vowel category conditional on group
#'
#' Applies classify_vowels to test data for each group_model.
#'
#' @param data_test test data to calculate posteriors for
#' @param group_models named list of group models, each of which is a named list
#'   of vowel models
#' @return data frame with one row per data_test row x group x vowel model, with
#'   added columns group_model (name of group model), vowel_model (name of vowel
#'   model), lhood p(x | vowel_model, group_model), posterior (p(vowel_model |
#'   x, group_model)).
compute_category_post_given_group <- function(data_test, group_models) {
  group_models %>%
    map(~ unlist_models(., 'category')) %>%
    map(~ classify(data_test, ., 'category')) %>%
    data_frame(group_model=names(.),
               x=.) %>%
    unnest(x) %>%
    rename(category_model=model)
}

#' Combine category | group posteriors with group posteriors
#'
#' @param group_category_posteriors category posterior probabilities conditional
#'   on group, in the form of a data frame with at least columns category_model,
#'   group_model, and posterior (e.g., output of
#'   compute_category_post_given_group)
#' @param group_posterior marginal group posterior probabilities, in the form of
#'   a data frame with columns group_model and log_posterior (e.g., output of
#'   compute_group_marginal_posterior)
#' @return a data frame with the joint posterior of category category and group,
#'   in posterior and log_posterior.
#' 
compute_joint_category_group_post <- function(group_category_posteriors, group_posteriors) {
  group_posteriors %>%
    select(group_model, group_log_posterior=log_posterior) %>%
    inner_join(group_category_posteriors, by = 'group_model') %>%
    mutate(log_posterior = log(posterior) + group_log_posterior,
           posterior = exp(log_posterior))
}

#' Compute joint indexical-linguistic posterior
#'
#' @param trained data frame with \code{models} and \code{data_test} (as
#'   produced by \code{\link{train_models_indexical_with_holdout}}).
#' @param obs_vars quoted names of columns in test data that together indentify
#'   a single observation (e.g., \code{c('Vowel', 'Token')})
#' @return a data frame with one observation per combination of group (e.g.,
#'   Dialect), category (e.g. "ae"), and row in the ORIGINAL, un-nested data
#'   set, with new columns \code{group_model}, \code{category_model},
#'   \code{lhood}, \code{posterior}, and \code{log_posterior}. Posterior
#'   probabilities sum to 1 within each cross-validation fold (e.g., Talker) +
#'   observation (e.g., Vowel+Token) combination, over all values of category
#'   and group.
#' 
trained_to_joint_post <- function(trained, obs_vars) {

  trained %>%
    mutate(conditional_posteriors = map2(data_test, models,
                                         compute_category_post_given_group),
           group_posteriors = map(conditional_posteriors,
                                  . %>%
                                    group_by_(.dots=obs_vars) %>%
                                    mutate(log_lhood = log(lhood)) %>%
                                    marginalize_log('log_lhood', 'group_model') %>%
                                    ungroup() %>%
                                    aggregate_log_lhood('log_lhood', 'group_model') %>%
                                    normalize_log_probability('log_lhood')),
           joint_posteriors = map2(conditional_posteriors, group_posteriors,
                                   compute_joint_category_group_post)) %>%
    unnest(joint_posteriors)

}


#' @param d data frame
#' @param holdout Column defining cross validation folds
#' @param ... additional arguments passed to \code{\link{train_models}}.
classify_by_talker_cv <- function(d, holdout='Token', category='Vowel', ...) {
  train <- partial(train_models, grouping=category, ...)

  d %>%
    phondisttools::train_test_split(holdout=holdout) %>%
    mutate(models_trained = map(data_train,
                                . %>% group_by(Talker) %>% train()),
           models_tested = map2(data_test, models_trained, classify,
                                category=category)) %>%
    unnest(models_tested) %>%
    mutate(grouping = 'Talker',
           group_is = 'Known',
           group = Talker) %>%
    rename(category_model = model)
}


```


```{r vowel-classification-models-group-known, cache=TRUE, dependson=c('vowel-data', 'classification-helpers')}

## we need posterior prob of true category to be able to compute the likelihood
## ratio metric so collect it here.
acc_method <- "posterior"

min_talker_per_group <- function(d) {
  d %>%
    do(n = length(unique(.$Talker))) %>%
    select_('n') %>%
    unlist() %>%
    min()
}

## classify and get accuracy
## don't care about groups within train/test split (already restricted to same
## group) so we can just use train_models and classify directly
train_test_acc <- function(data_train, data_test, category, ...) {
  data_train %>%
    train_models(grouping=category, cues=c("F1", "F2")) %>%
    rename_(category=category) %>%
    classify(data_test, ., 'category') %>%
    rename_(category_model = 'model') %>%
    get_accuracy(category_col = category, method=acc_method)
}

subsample_if_needed <- function(d, n_reps, holdout) {
  assert_that(nrow(d) == 1)
  if (d$group_size > d$subsample_size+1) {
    map(seq_len(n_reps),
        ~ mutate(d,
                 data_train = map2(data_train,
                                   subsample_size,
                                   sample_n_groups,
                                   group=holdout),
                 resamp_fold = .x)) %>%
      lift(bind_rows)()
  } else {
    mutate(d, resamp_fold = 1L)
  }
}

cluster <- create_cluster() %>%
  cluster_library(c("tidyverse", "phondisttools", "assertthat")) %>%
  cluster_copy(train_test_acc) %>%
  cluster_copy(subsample_if_needed) %>%
  cluster_copy(acc_method)

set.seed(100)                    # randomized stuff happends before partition...

subsample_sizes <- c(3)

vowel_acc_same_group_rep <-
  vowel_data_grouped %>%
  mutate(group_size = map_dbl(data, min_talker_per_group)) %>%
  right_join(cross_d(list(group_size = unique(.$group_size),
                          subsample_size = subsample_sizes)) %>%
               filter(group_size > subsample_size)) %>%
  # separate out data by groups prior to doing train/test splits
  unnest(map2(data, grouping, ~ nest(.x) %>% rename_(group=.y))) %>%
  unnest(map(data, train_test_split, holdout="Talker")) %>%
  ungroup() %>% 
  mutate(n_ = 1:n()) %>%
  # can't use rowwise() because it makes do extract list elements
  partition(cluster=cluster) %>%
  group_by(n_) %>%
  # downsample training sets where necessary
  do(subsample_if_needed(., n_reps=100, holdout="Talker")) %>%
  mutate(acc = map2(data_train, data_test, train_test_acc, category="Vowel")) %>%
  collect() %>%
  unnest(acc)

```

```{r vowel-classification, cache=TRUE, dependson=c('classification-helpers')}

cluster <- cluster_copy(cluster, classify_by_talker_cv)

vowel_talker_class <-
  vowel_data %>%
  partition(cluster=cluster) %>%
  mutate(acc = map(data, classify_by_talker_cv, holdout='Token',
                   category="Vowel",cues=c("F1", "F2"))) %>%
  collect() %>%
  unnest(acc) %>%
  mutate(group_is = 'Known')

```

```{r heald-vowel-classifier, cache=TRUE, dependson=c("classification-helpers", "heald-data")}

heald_models

n_heald_token_samples <- 100

heald_samples <-
  heald_models %>%
  filter(grouping == "Talker") %>%
  pull(models) %>%
  first() %>%
  unnest(model %>%
           map(r_model, n=n_heald_token_samples) %>%
           map(as_data_frame)) %>%
  mutate(Talker = group,
         Gender = ifelse(Talker %in% male_speakers, "m", "f"),
         Marginal = "all",
         Token = 1:n())

heald_vowel_class <-
  heald_models %>%
  mutate(data = grouping %>%
           map(~ mutate_(heald_samples, group=.x) %>% group_by(group) %>% nest()),
         models = models %>%
           map(group_by, group) %>%
           map(nest, .key="model"),
         ## map(~ mutate(.x, group_models=map(group_models, list_models, names_col="Vowel"))) %>%
         classified =
           map2(models, data, left_join) %>%
           map(~ map2(.x$data, .x$model, classify, category="Vowel")) %>%
           map(lift(bind_rows))
         ) %>%
  unnest(classified)

```

```{r vot-classification-models, cache=TRUE, dependson=c('vot-data', 'classification-helpers')}

set.seed(101)

vot_models <-
  vot_by_place_grouped %>%
  filter(grouping != 'Talker') %>%
  mutate(trained = map2(data, grouping, train_models_indexical_with_holdout,
                        category = 'voicing', cues = 'vot'),
         joint_posteriors = map(trained, trained_to_joint_post,
                                obs_vars = c('voicing', 'Token')))

```

```{r vot-classification, cache=TRUE, dependson=c('vot-classification-models', 'classification-helpers')}

vot_joint_class <-
  vot_models %>%
  unnest(map2(joint_posteriors, grouping, ~ rename_(.x, group=.y))) %>%
  group_by(place, cues, grouping)

vot_marginal_class <-
  vot_joint_class %>%
  group_by(Talker, voicing, Token, group, category_model, add=TRUE) %>%
  marginalize_log('log_posterior') %>%
  normalize_log_probability('log_posterior') %>%
  mutate(group_is = 'Inferred')

vot_true_group_class <-
  vot_joint_class %>%
  filter(group == group_model) %>%
  group_by(Talker, voicing, Token, add=TRUE) %>%
  normalize_log_probability('log_posterior') %>%
  mutate(group_is = 'Known')

vot_talker_class <-
  vot_by_place %>%
  unnest(map(data, . %>%
                     group_by(Talker, voicing) %>%
                     mutate(split=ntile(runif(length(Talker)), 6)) %>%
                     classify_by_talker_cv(holdout='split',
                                           category='voicing',
                                           cues='vot')
             )
         )

vot_class <-
  bind_rows(vot_marginal_class,
            vot_true_group_class,
            vot_talker_class)

```

```{r check-classification, results='hide', cache=TRUE, dependson=c('vot-classification')}

vot_class %>%
  group_by(cues, place, group_is, grouping, Talker, voicing, Token) %>%
  summarise(n_choice = sum(posterior_choice),
            sum_post = sum(posterior)) %$%
  assert_that(all(n_choice == 1),
              all.equal(sum_post, rep(1, length(sum_post))))

```


```{r classification-accuracy}

chance_acc <- tribble(
  ~contrast, ~chance_acc,
  "Vowels (NSP)", 1/11,
  "Vowels (HN15)", 1/7,
  "Stop voicing", 1/2
  )

vot_accuracy <-
  vot_class %>%
  get_accuracy('voicing', method=acc_method) %>%
  mutate(accuracy = as.double(accuracy),
         contrast = "Stop voicing")

vowel_talker_acc <-
  vowel_talker_class %>%
  get_accuracy('Vowel', method=acc_method) %>%
  mutate(accuracy = as.double(accuracy))

vowel_accuracy <-
  vowel_acc_same_group_rep %>%
  filter(subsample_size==3) %>%
  group_by(cues, contrast, grouping, subsample_size, group, Talker, Vowel, Token) %>%
  summarise(accuracy = mean(accuracy)) %>%
  bind_rows(mutate(vowel_talker_acc, accuracy = as.double(accuracy))) %>%
  mutate(group_is = 'Known')

heald_vowel_acc <-
  heald_vowel_class %>%
  rename(category_model = model) %>%
  get_accuracy("Vowel", acc_method) %>%
  mutate(accuracy = as.double(accuracy),
         group_is = "Known")

logodds <- function(x) log(x) - log(1-x)

accuracy <- 
  vowel_accuracy %>%
  select(-Age) %>%
  bind_rows(vot_accuracy, heald_vowel_acc) %>%
  left_join(chance_acc) %>%
  ungroup() %>%
  prettier_grouping()

accuracy_by_talker <-
  accuracy %>%
  filter(group_is == "Known") %>% 
  group_by(contrast, cues, grouping, group, group_is, chance_acc, Talker) %>%
  summarise(accuracy = mean(accuracy),
            n = n()) %>%
  mutate(accuracy_logodds = logodds(accuracy) - logodds(chance_acc))

## summary for each contrast/cues/grouping level, bootstrapped by talker
accuracy_summary <-
  accuracy_by_talker %>%
  group_by(contrast, cues, grouping, group_is) %>%
  do(daver::boot_ci(.$accuracy, function(d,i) logodds(mean(d[i])))) %>%
  left_join(chance_acc) %>%
  rename(accuracy = observed,
         accuracy_low = ci_lo,
         accuracy_high = ci_high) %>%
  mutate_at(vars(starts_with("accuracy")), funs(.-logodds(chance_acc)))

accuracy_summary_perc <-
  accuracy_by_talker %>%
  group_by(contrast, cues, grouping, group_is) %>%
  do(daver::boot_ci(.$accuracy, function(d,i) mean(d[i]))) %>%
  left_join(chance_acc) %>%
  rename(accuracy = observed,
         accuracy_low = ci_lo,
         accuracy_high = ci_high)

```


```{r talker-advantage-acc, eval=FALSE}

## Talker advantage
## (TODO: incorporate)
## TODO: fix me
talker_advantage_acc <- 
  accuracy_by_talker %>%
  ## group_by(cues, contrast, grouping, Talker) %>%
  ## summarise(accuracy=mean(accuracy)) %>%
  rename(Talker_=Talker) %>%
  group_by(cues, contrast, grouping) %>%
  spread(grouping, accuracy) %>%
  gather(comparison, accuracy, -cues, -contrast, -Talker, -Talker_) %>%
  filter(!is.na(accuracy)) %>%
  mutate(talker_advantage = Talker - accuracy) %>%
  group_by(contrast, cues, comparison) %>%
  do({ boot_ci(.$talker_advantage, function(d,i) mean(d[i], na.rm=TRUE), h0=0) }) %>%
  filter(is.finite(observed))

```

```{r acc-pairwiwse-boot, cache=TRUE, dependson="classification-accuracy"}

## compute all pairwise differences in (log-odds) accuracy

accuracy_by_talker_pairwise <-
  accuracy_by_talker %>%
  group_by(contrast, cues, grouping) %>%
  summarise()  %>%
  nest() %>%
  unnest(
    # generate pairs of groupings
    data %>%
      map(pull, grouping) %>%
      map( ~ cross_df(list(to=., from=.))) %>%
      map(filter, as.numeric(from)<as.numeric(to))
  ) %>%
  # join in data for from and to
  left_join(accuracy_by_talker,
            by=c(from="grouping", "cues", "contrast")) %>%
  left_join(accuracy_by_talker,
            by=c("contrast", "cues", "group_is", "chance_acc", "Talker", to="grouping"),
            suffix=c("_from", "_to"))

accuracy_by_vowel_talker_pairwise <-
  accuracy_by_talker %>%
  group_by(contrast, cues, grouping) %>%
  summarise()  %>%
  nest() %>%
  unnest(
    # generate pairs of groupings
    data %>%
      map(pull, grouping) %>%
      map( ~ cross_df(list(to=., from=.))) %>%
      map(filter, as.numeric(from)<as.numeric(to))
  ) %>%
  # join in data for from and to
  left_join(accuracy_by_talker,
            by=c(from="grouping", "cues", "contrast")) %>%
  left_join(accuracy_by_talker,
            by=c("contrast", "cues", "group_is", "chance_acc", "Talker", to="grouping"),
            suffix=c("_from", "_to"))

accuracy_diffs_boot <- 
  accuracy_by_talker_pairwise %>%
  group_by(contrast, cues, from, to, group_is) %>%
  do(
    daver::boot_ci(.,
                   function(d,i)
                     logodds(mean(d$accuracy_to[i])) -
                       logodds(mean(d$accuracy_from[i])),
                   h0=0)
  )

```

## Results {#sec:recog-results}

First, I report and discuss the overall utility of the different grouping
variables for the stop voicing and the two vowel databases I used.  Second, I
discuss the effect of vowel normalization on utility.  Third and finally, I
examine how utility varies across dialects and individual vowels

```{r overall-accuracy-group-known, dependson=c("grouping-and-colors"), fig.width=9.8, fig.height=6.4, fig.cap='Average information-gain in log-odds relative to chance (top) measures the utility of each grouping variable.  Bottom shows posterior probability of correct category for comparison.  Small points show individual talkers.  Large points and lines show mean and bootstrapped 95% CIs over talkers (see text for details).'}

# calculate layout of bars and stars
min_offset <- 0.1
nudge <- 0.1
maxes <- accuracy_by_talker %>%
  group_by(contrast, cues, grouping) %>%
  summarise(max_acc = max(accuracy_logodds))
accuracy_diffs_boot %>%
  filter(boot_p < 0.05) %>%
  left_join(maxes, by=c("contrast", "cues", from="grouping")) %>%
  left_join(maxes, by=c("contrast", "cues", to="grouping")) %>%
  mutate(max_acc = max(max_acc.x, max_acc.y),
         max_acc.x = NULL, max_acc.y = NULL) %>%
  ## sort by the max of the underlying data points within each facet
  group_by(contrast, cues) %>%
  arrange(max_acc, .by_group=TRUE) %>%
  ## add a slight offset
  mutate(offset = max_acc - lag(max_acc, default=0),
         adj_max = cumsum(ifelse(offset > min_offset, offset, min_offset))+nudge) ->
  acc_diff_boot_bars

p_percentage <-
  ggplot(accuracy_by_talker %>% cues_plus_contrast(),
         aes(x=grouping, y=accuracy, color=grouping)) +
  geom_quasirandom(alpha=0.2) +
  geom_pointrange(data=accuracy_summary_perc %>% cues_plus_contrast(),
                  aes(ymin=accuracy_low, ymax=accuracy_high)) +
  facet_grid(.~cues_contrast, scales="free_x", space="free_x") +
  labs(x="Grouping",
       y="Accuracy\n(post. prob.)") +
  rotate_x_axis_labs() +
  theme(strip.background = element_blank(),
        strip.text=element_blank()) +
  background_grid(major="xy") +
  scale_color_grouping()

p_logodds <-
  accuracy_summary %>%
  filter(group_is == "Known") %>%
  cues_plus_contrast() %>%
  ggplot() +
  geom_quasirandom(data = accuracy_by_talker %>% cues_plus_contrast(),
                   aes(x = grouping, y=accuracy_logodds, color = grouping), alpha=0.2)+
  geom_pointrange(aes(x = grouping, y=accuracy, ymin=accuracy_low, ymax=accuracy_high, color = grouping), stat='identity')+
  geom_segment(data=acc_diff_boot_bars %>% cues_plus_contrast(),
               aes(x=from, xend=to, y=adj_max, yend=adj_max, alpha=boot_p))+
  facet_grid(. ~ cues_contrast, scales='free_x', space='free_x') +
  scale_alpha(range=c(1, 0), limits=c(0, 0.05)) +
  labs(y = 'Information gain over random guessing\n(log-odds relative to chance)',
       alpha = "Bootstrapped\np-value",
       color = "Grouping\nvariable") +
  lims(y=c(0,NA)) +
  theme(plot.title=element_text(hjust=0),
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  background_grid(major="xy") +
  scale_color_grouping()

plot_grid(p_logodds + theme(legend.position="none"),
          p_percentage + theme(legend.position="none"),
          align='v', ncol=1, rel_heights=c(1.6,1)) %>%
  plot_grid(., get_legend(p_logodds), rel_widths=c(0.8, 0.2))

```

Utility can be measured with respect to a number of baselines.  First, by
measuring information gain relative to chance performance (random guessing), we
get a measure of the absolute utility of a particular socio-indexical grouping.
This measure is plotted in Figure @fig:overall-accuracy-group-known.  All
grouping variables---even the marginal grouping which considers all talkers
together---provide _some_ information gain over random guessing, between 2 and 4
log-odds.  Moreover, marginal distributions for vowels (with un-normalized
F1×F2) and stop voicing show similar amounts of information gain over random
guessing, despite different numbers of categories and cues and very different
levels of overall accuracy (Figure @fig:overall-accuracy-group-known, bottom
panel).  This suggests that information gain could be a useful metric for
utility across different phonetic categories and cues.

Second, by comparing information gain between different grouping variables, we
get a measure of _relative_ utility, or how much additional information a
listener would gain about the talker's intended category by tracking (and using)
these distributions.  As expected, within each contrast/cue combination, the
marginal cue distributions (from all talkers) provide the least information
gain, while talker-specific distributions provide the most.

```{r info-gain-summaries}
format_advantage <- function(d, ci_descrip='95% CI', p=TRUE, paren=TRUE) {
  adv_string <- sprintf('%.2f', d$observed)
  ci_string <- sprintf('%s [%.2f--%.2f]',
                       ci_descrip, d$ci_lo, d$ci_high)
  p_string <- paste(',', daver::p_val_to_less_than(d$boot_p))

  if (paren) paste0(adv_string, ' (', ci_string, ifelse(p, p_string, ''), ')')
  else paste0(adv_string, ', ', ci_string, ifelse(p, p_string, ''))
}

hz_marg_talker <-
  accuracy_diffs_boot %>%
  filter(cues == "F1×F2 (Hz)",
         from == "Marginal",
         to == "Talker")

lob_marg_talker <-
  accuracy_diffs_boot %>%
  filter(cues == "F1×F2 (Lobanov)",
         from == "Marginal",
         to == "Talker")

hz_marg_gender <-
  accuracy_diffs_boot %>%
  filter(cues == "F1×F2 (Hz)",
         from == "Marginal",
         to == "Gender")

vot_marg_talker <-
  accuracy_diffs_boot %>%
  filter(cues == "VOT",
         from == "Marginal",
         to == "Talker")

hz_marg_dialect <-
  accuracy_diffs_boot %>%
  filter(cues == "F1×F2 (Hz)",
         from == "Marginal",
         to == "Dialect")

lob_marg_dialect <-
  accuracy_diffs_boot %>%
  filter(cues == "F1×F2 (Lobanov)",
         from == "Marginal",
         to == "Dialect")

```

Despite similar levels of utility for marginal distributions, vowels and stop
voicing show very different levels of utility for group- or talker-specific
distributions.  For voicing (VOT), there is minimal---if any---additional
benefit to using cue distributions from more specific groupings; only
talker-specific distributions provide any additional information gain over
marginal, and this gain is small (log-odds of
`r format_advantage(vot_marg_talker, p=FALSE, paren=FALSE)`).
At the other extreme, for vowels, using talker-specific F1×F2
distributions increases utility over marginal distributions by log-odds of
`r format_advantage(hz_marg_talker[2,], p=FALSE)` for the NSP data and 
`r format_advantage(hz_marg_talker[1,], p=FALSE)` for the HN15 data.
Even less-specific groupings like gender still have reliable additional utility
over marginal distributions for vowels (NSP: log-odds of
`r format_advantage(hz_marg_gender[1,], p=FALSE, paren=FALSE)`;
HN15: log-odds of
`r format_advantage(hz_marg_gender[2,], p=FALSE, paren=FALSE)`).

### Normalization of vowel formants

The results of study 1 showed that Lobanov normalization make talker-specific
formant distributions less informative, relative to marginal.  Thus, we might
expect that there will be lower _utility_ for talker-specific distributions as
well.  However, as Figure
@fig:overall-accuracy-group-known shows, the utility of talker-specific
distributions _per se_ is not lower for Lobanov vs. raw F1×F2.  Nevertheless, the
additional utility of talker-specific over marginal distributions goes down because the
baseline utility of marginal distributions goes up.
Lobanov normalization removes much of the across-talker variability, leading to
less overlap between the marginal distributions for individual vowels, less
confusion between categories, and higher accuracy.  But for individual talkers
considered alone, linear transformations like Lobanov normalization have no
effect, since they leave the relative positions and sizes of the category
distributions unchanged.  Hence, the utility of talker-specific distributions is
exactly the same for raw and Lobanov normalized F1×F2.

While the reduction in additional utility for talker-specific distributions is
predictable based on the lower informativity (study 1), the _extent_ of this
reduction in surprising: using talker-specific distributions of 
raw F1×F2 Hz provides additional information gain of 
`r format_advantage(hz_marg_talker[2,], p=FALSE, paren=TRUE)`, 
which drops to
`r format_advantage(lob_marg_talker, p=FALSE, paren=TRUE)`
after Lobanov normalization.  This is comparable to the additional utility of
talker-specific VOT distributions
(`r format_advantage(vot_marg_talker, p=FALSE, paren=FALSE)`).
That is, after normalization to remove overall shifts in F1×F2, the consequences
of talker variability in vowel and stop voicing distributions for _speech
recognition_ may actually be more comparable than suggested by the informativity
measured in study 1.

As with informativity, Lobanov normalization also reveals additional _structure_
in that talker variability.  For raw F1×F2, dialect provides only weakly
reliable additional utility over marginal distributions (log-odds of
`r format_advantage(hz_marg_dialect, p=FALSE, paren=FALSE)`).
For Lobanov-normalized F1×F2, the additional utility of dialect is both larger
and more reliable (log-odds of
`r format_advantage(lob_marg_dialect, p=FALSE, paren=FALSE)`).

### Dialect

```{r dialect-diffs, cache=TRUE, dependson=c("grouping-and-colors"), fig.width=7.2, fig.height=5.2, fig.cap="The advantage of knowing a talker's dialect varies by dialect.  Knowing a talker comes from the North regions provides a consistent benefit, regardless of cues (Hz or Lobanov-normalized) or baseline (marginal or gender).  Otherwise, dialect does not provide consistent information gain except when using Lobanov-normalized cue values, and even then it varies by dialect.  Each point shows one talker, the error bars bootstrapped 95% CIs by talker, and the stars bootstrapped $p$-values adjusted for false discovery rate [@Benjamini1995]."}

dialect_vs <-
  tribble(
    ~from, ~to,
    "Marginal", "Dialect",
    "Gender", "Dialect+Gender"
  ) %>%
  prettier_grouping(from) %>%
  prettier_grouping(to)

talker_dialects <-
  nsp_vows %>%
  group_by(Talker, Dialect) %>%
  summarise()

dialect_advantage_diffs_boot <- 
  accuracy_by_talker_pairwise %>%
  filter(contrast == "Vowels (NSP)") %>%
  inner_join(dialect_vs) %>%
  left_join(talker_dialects) %>%
  group_by(contrast, cues, from, to, Dialect) %>%
  do(
    daver::boot_ci(.,
                   function(d,i)
                     logodds(mean(d$accuracy_to[i])) -
                       logodds(mean(d$accuracy_from[i])),
                   h0=0)
  ) %>%
  mutate(boot_p_fdr = p.adjust(boot_p, "fdr"),
         boot_p_stars = daver::p_val_to_stars(boot_p_fdr),
         from_to = paste(to, "vs.", from))

accuracy_by_talker_pairwise %>%
  inner_join(dialect_vs) %>%
  inner_join(talker_dialects) %>%
  mutate(from_to = paste(to, "vs.", from)) %>%
  ggplot(aes(x=Dialect)) +
  geom_hline(yintercept=0, color="gray50") +
  geom_quasirandom(aes(y=logodds(accuracy_to)-logodds(accuracy_from),
                       color=to), alpha=0.2) +
  geom_pointrange(data=dialect_advantage_diffs_boot,
                  aes(y=observed, ymin=ci_lo, ymax=ci_high, color=to)) +
  geom_text(data = dialect_advantage_diffs_boot,
            aes(y=ci_high, label=boot_p_stars), nudge_y=0.05) +
  facet_grid(contrast+cues~from_to) +
  labs(x="",
       y="Additional information gain\n(log-odds ratio)",
       color="Advantage\nover") +
  ggtitle("Additional information gain from knowing dialect",
          subtitle = "By dialect") +
  rotate_x_axis_labs() +
  theme(plot.title=element_text(hjust=0)) +
  scale_color_grouping(labels=dialect_vs$from)


```


```{r vowel-dialect-diffs, cache=TRUE, dependson=c("grouping-and-colors"), fig.width=7.2, fig.height=4.1, fig.cap="The information gained from knowing a talker's dialect also varies by the particular vowel.  Vowels undergoing active sound change in multple dialects of American English (like /æ/, /ɛ/, /ɑ/, and /u/) tend to benefit more from knowing dialect.  (Single talker estimates of information gain are not shown because the small sample size $n\\leq5$ for individual talkers makes them numerically unstable, while the overall log-odds ratios calculated from the mean accuarcies are more stable.)  CIs are 95% bootstrapped CIs for the mean over talkers.  All $p>0.01$ (corrected for false discovery rate), and whether an individual $p$ value is less or greater than $p=0.05$ is sensitive to the bootstrap and subsampling randomization so stars are not shown."}
set.seed(1100)

accuracy_by_talker_vowel <-
  accuracy %>%
  group_by(contrast, cues, grouping, group, Vowel, Talker) %>%
  summarise(accuracy = mean(accuracy),
            n = n())

accuracy_by_vowel_talker_pairwise <-
  accuracy_by_talker_vowel %>% 
  filter(contrast == "Vowels (NSP)") %>%
  group_by(cues, grouping) %>%
  summarise()  %>%
  nest() %>%
  unnest(
    # generate pairs of groupings
    data %>%
      map(pull, grouping) %>%
      map( ~ cross_df(list(to=., from=.))) %>%
      map(filter, as.numeric(from)<as.numeric(to))
  ) %>%
  inner_join(dialect_vs) %>%
  # join in data for from and to
  inner_join(accuracy_by_talker_vowel,
            by=c(from="grouping", "cues")) %>%
  inner_join(accuracy_by_talker_vowel,
            by=c("contrast", "cues", "Talker", "Vowel", to="grouping"),
            suffix=c("_from", "_to"))

dialect_advantage_by_vowel_diffs_boot <-
  accuracy_by_vowel_talker_pairwise %>%
  inner_join(talker_dialects) %>%
  group_by(cues, to, from, Vowel) %>%
  do(
    daver::boot_ci(.,
                   function(d,i)
                     logodds(mean(d$accuracy_to[i])) -
                       logodds(mean(d$accuracy_from[i])),
                   h0=0)
  ) %>%
  ungroup() %>%
  mutate(boot_p_fdr = p.adjust(boot_p, "fdr"), 
         boot_p_stars = daver::p_val_to_stars(boot_p_fdr,
                                              cutoffs = c(1, .1, .05, .01),
                                              stars = c("", "·", "*", "**")),
         Vowel=factor(Vowel, levels=levels(nsp_vows$Vowel)))

dialect_advantage_by_vowel_diffs_boot %>%
  mutate(from_to = paste(to, "vs.", from)) %>%
  ggplot(aes(x=Vowel, color=to)) +
  geom_hline(yintercept=0, color="gray50") +
  ## very noisy (small n):
  ## geom_quasirandom(data = accuracy_by_vowel_talker_pairwise %>%
  ##                    mutate(from_to = paste(to, "vs.", from)),
  ##                  aes(y=logodds(accuracy_to)-logodds(accuracy_from)), alpha=0.1) +
  geom_pointrange(aes(y=observed, ymin=ci_lo, ymax=ci_high)) +
  geom_label(aes(y=observed, label=Vowel), label.padding=unit(0.1, "lines"),
             show.legend=FALSE) +
  facet_grid(cues~from_to) +
  ## geom_text(aes(y=ci_high, label=boot_p_stars), nudge_y=0.1, color="black") +
  labs(x="",
       y="Additional information gain\n(log-odds ratio)",
       color="Advantage\nover") +
  ggtitle("Additional information gain from knowing dialect",
          subtitle = "By vowel") +
  theme(plot.title=element_text(hjust=0)) +
  scale_color_grouping(labels=dialect_vs$from)


```

```{r dialect_and_vowel, eval=FALSE}

## The goal is to make a figure somewhat like the dialect contour plots for KL
## above.  but it's kind of a lot to take in, and not particularly informative
accuracy_by_vowel_talker_pairwise %>%
  inner_join(talker_dialects) %>%
  group_by(Vowel, Dialect, cues, contrast, to, from) %>%
  do(
    daver::boot_ci(.,
                   function(d,i)
                     logodds(mean(d$accuracy_to[i])) -
                       logodds(mean(d$accuracy_from[i])),
                   h0=0)
  ) %>%
  group_by(cues, contrast) %>%
  mutate(boot_p_fdr = p.adjust(boot_p, "fdr")) %>%
  mutate(Vowel=factor(Vowel, levels=levels(nsp_vows$Vowel))) %>%
  {
    ggplot(., aes(x=Dialect, y=observed, color=Vowel, group=Vowel)) +
      geom_line() +
      geom_point(data=subset(., boot_p_fdr<0.05)) +
      facet_grid(from ~cues+contrast ) +
      rotate_x_axis_labs()
  }

```

```{r kl-vs-acc, eval=FALSE}

accuracy_by_vowel_talker_pairwise %>%
  inner_join(talker_dialects) %>%
  group_by(cues, to, from, contrast, Vowel, Dialect) %>%
  summarise(accuracy_logodds = mean(logodds(accuracy_to) - logodds(accuracy_from))) %>%
  left_join(vowel_kl, by=c("contrast", "cues", Dialect="group", "Vowel"))  %>%
  ggplot(aes(x=KL, y=accuracy_logodds)) + 
  geom_text(aes(color=Dialect, label=Vowel)) + 
  stat_smooth(method="lm") + 
  stat_smooth(method="lm", aes(color=Dialect), alpha=0) + 
  facet_grid(from ~ cues, scales="free_y")

```

Study 1 found that the informativity of dialect about formant
distributions depended on both the dialect and specific vowel.  Similarly, the
_utility_ of using dialect-specific cue distributions (relative to marginal or
gender-specific) varies by dialect (Figure @fig:dialect-diffs) and vowel (Figure
@fig:vowel-dialect-diffs).  Talkers from the North dialect region have a
consistent additional information gain from using dialect- or
dialect+gender-specific cue distributions, regardless of normalization.  This
likely reflects the fact that under the Northern Cities Shift the /æ/ vowel is
raised, making it highly overlapping with the /ɛ/ from talkers of other dialects
and leading to reduced accuracy.  With un-normalized F1×F2, no other dialects
show a consistent benefit from dialect-specific cue distributions (either alone
or with dialect+gender).  However, with Lobanov-normalized F1×F2, using
dialect-specific distributions _does_ lead to better vowel recognition (on the
order of log odds of 0.4) for many---but not all---dialects, especially when
additionally considering gender.

Somewhat surprisingly, even with normalized F1×F2, there is no consistent
information gain for using dialect-specific cue distributions for Southern
speakers.  @Clopper2005 found that these same speakers demonstrated many of the
vowel shifts that are characteristic of this dialect region [@Labov2006], and
the results of study 1 (Figure @fig:vowel-kl-by-dialect, left) show that on average
Southern speakers distributions do diverge from the marginal.  But study 1
_also_ found that no _individual_ Southern vowel distributions diverged enough
from the marginal to be significantly more informative than a random grouping of
talkers (@fig:vowel-kl-by-dialect, right), at least after correcting for
multiple comparisons.

As with individual dialects, individual vowels vary in the extent to which
conditioning on dialect provides additional information.  Figure
@fig:vowel-dialect-diffs shows that for most vowels, there is little evidence
that conditioning on dialect consistently provides additional information gain
across dialect.  There is weak evidence that a few vowels may get a reliable
boost with normalized formants, like /æ/, /ɛ/, and /ɑ/, all of which are
undergoing sound change in at least one dialect, and also show high
informativity across dialects (Figure @fig:vowel-kl-by-dialect).[^weak-pvals]
`r label("r1-pvals")`

[^weak-pvals]: I do not report on significance of individual vowel effects here
    because they are estimated using a randomized procedure---both at the level
    of subsampling talkers to estimate the accuracy, and at the level of
    bootstrapping to estimate statistical significance---and all $p>0.01$ after
    correcting for false discovery rate.  I found that, even with a reasonably
    large number of subsampling and bootstrap iterations (100 and 1000,
    respectively), individual effects that are weakly significant in one run
    ($0.05 > p > 0.01$) are often only "marginally significant" ($0.1 > p >
    0.05$) in another.  Properly assessing the reliability of these effects is
    best left to future experiments designed to detect them.

## Discussion

`r label("discussion2")`

```{r acc-changes}

hz_err_perc <- accuracy_summary_perc %>%
  filter(str_detect(cues, "Hz"),
         str_detect(contrast, "NSP")) %>%
  { set_names(pull(., accuracy), pull(., grouping)) } %>%
  map_dbl(~ round(100*(1-.x))) %>%
  map_chr(~ sprintf("%d%%", .x))

lob_err_perc <- accuracy_summary_perc %>%
  filter(str_detect(cues, "Lobanov")) %>%
  { set_names(pull(., accuracy), pull(., grouping)) } %>%
  map_dbl(~ round(100*(1-.x))) %>%
  map_chr(~ sprintf("%d%%", .x))

vot_err_perc <- accuracy_summary_perc %>%
  filter(str_detect(cues, "VOT")) %>%
  { set_names(pull(., accuracy), pull(., grouping)) } %>%
  map_dbl(~ round(100*(1-.x))) %>%
  map_chr(~ sprintf("%d%%", .x))

```

Despite dramatic differences between vowels and stop voicing in the
informativity of talker- and group-conditioned distributions (study 1), the
results of this study show that the _utility_ of conditioning phonetic category
judgements on talker or group are more comparable, especially for normalized
formants.  Using talker-specific cue distributions improves correct recognition
of stop voicing and vowels by about log-odds of 0.5, except for un-normalized
formants, where the improvement is more like 1.5 log odds.  This seems like a
relatively small information gain, especially since marginal distributions
themselves provide more than 4-6 times that much information gain over random
guessing.  However, when converted back to error percentage, the information gain
from talker-specific distributions corresponds to avoiding about one out of
every five errors: a change in error rate from 
`r lob_err_perc["Marginal"]` to 
`r lob_err_perc["Talker"]` for (normalized) NSP vowels, and from 
`r vot_err_perc["Marginal"]` to 
`r vot_err_perc["Talker"]` for stop voicing.
These errors would not always lead to high-level misunderstanding, but
avoiding them nevertheless reduces the burden on the listener to reconcile
conflicting lexical, contextual, or phonetic information.

While helpful, these differences in error rates show that using talker- or
group-specific distributions is not a make-or-break factor in recognizing vowels
or stop voicing.  Rather, they make comprehension more robust and efficient.
One major caveat is that this is only true for _normalized_ vowel formants.  For
raw Hz, using talker-specific distributions eliminates nearly two out of every
three errors
(`r hz_err_perc["Marginal"]` vs. `r hz_err_perc["Talker"]`).  Using
gender-specific distributions is only moderately helpful (error rate of 
`r hz_err_perc["Gender"]`).
This means that listeners can benefit greatly from extracting _some_
talker-specific factor.  Whether that factor is separate means and variances of
each category, or the _overall_ mean and variance of each _cue_ (as is used in
Lobanov normalization) is a question that reminds to be addressed in future
work.  As I discuss more in the discussion below, either of these is compatible
with Bayesian models like the ideal adapter that learn from experience.

# General Discussion

Recent theories of speech recognition propose that listeners deal with talker
variability by taking advantage of statistical contingencies between
socio-indexical variables (talker identity, gender, dialect, etc.) and
acoustic-phonetic cue distributions [@Kleinschmidt2015; @McMurray2011a;
@Sumner2014].  A major question that these theories raise is _which_
contingencies listeners should learn and use.  Listeners cannot learn and use
every possible contingency, since they are limited by finite cognitive resources. 
Moreover, as I discuss below, listeners _should_ not draw on every possible
contingency given their finite experience.

As a first step towards answering this question, I used computational methods from
ideal observer/adapter models to quantify the _degree_ and _structure_ of talker
variability.  I measured the extent to which a range of socio-indexical
variables are 1) _informative_ about category-specific cue distributions and 2)
_useful_ for recognizing phonetic categories, in two phonetic domains: vowels
and word-initial stop voicing.  Overall, I found that there is less talker
variability for VOT than for vowel formants, and talker variability for VOT
is less structured, at least according to the socio-indexical grouping variables
    investigated here.  Variability in vowel formant distributions is
_structured_, and a talker's dialect, gender, or the combination thereof are
each informative about vowel-specific cue distributions.  Moreover, tracking
group- or talker-specific cue distributions also improves vowel recognition,
although the biggest gains by far come from tracking the overall mean and
variance of a talker's formants (disregarding category)---that is, the
information required to normalize for overall shifts in formants.

In the remainder of this paper, I discuss the implications of these results.
First, the ideal adapter generally predicts that listeners should track
conditional distributions for groups that are informative and useful for speech
recognition. By directly quantifying the utility and informativity of a number
of grouping variables, these results are a step towards making more
specific predictions about what group-level representations listeners should
maintain if, as assumed by the ideal adapter, they are taking advantage of the
structure that is actually present in cross-talker variability.
Second, I argue that my results shed light on why studies on perceptual
learning have obtained seemingly conflicting results for different phonetic
contrasts.  Third and finally, I discuss how these measures of the
informativity/utility of socio-indexical variables like gender, age, and dialect
correspond to a _starting point_ for talker-specific learning.

## What to track?

Even without taking into account processing limitations, an ideal adapter should
not track _everything_. Rather, listeners should only track the joint
distributions of variables that are informative/useful.  At the level of
phonetic categories themselves, this means that (for instance) there is no
reason for listeners to track vowel-specific distributions of temperature or
barometric pressure.  Likewise for socio-indexical grouping variables: listeners
get no benefit for tracking separate distributions for different groups of
talkers for a cue that does not systematically vary between those groups.

In fact, it can actually _hurt_ a listener to track cue distributions at a level
that's not informative. The reason for this is related to one of the most
central challenges to learning, the bias-variance trade-off [@James2013, Section
2.2.2].  In general, the bias-variance trade-off says that accuracy is a function
of two things: the _bias_ of your model (e.g., from being too simple or having
the wrong structure) and the _variance_ of the model's parameter estimates
(e.g.,from not having enough data).

For the present purpose, this means that tracking multiple distributions will thus 
result in noisier, less accurate estimates than lumping together all the observations 
in a single distribution.  This price may
be worth paying for a listener when there are large enough differences between
groups that treating all observations as coming from the same distribution
_biases_ the estimates of the underlying distribution (and hence the inferences
that listeners make based on those distributions) far enough away from the true
structure of the data. To take a concrete example, modeling each vowel as a
single distribution of (un-normalized) formants across all talkers results in
broad, overlapping distributions which have low recognition
accuracy. But modeling them as two distributions---one for males, and one for
females---provides more specific estimates and higher classification accuracy,
as shown by Figures @fig:vowel-vot-kl-plot and @fig:overall-accuracy-group-known
[and in @Hillenbrand1995; @Feldman2013a].

Thus, the ideal adapter predicts that listeners should learn separate cue
distributions for levels of a socio-indexical grouping variable when that
variable has high _informativity_ about some categories' cue distributions
and/or high _utility_ for speech recognition.  To be precise, this is the
prediction if the goal of speech perception is the robust inference of
linguistic categories (such as phonetic or phonological categories, words, or
phrases).  Listeners also extract, for example, social and emotional information
from the speech signal.  Sociolinguistic research has recognized that, in many
cases, the communication of social information is just as---if not
more---important than the communication of linguistic information [@Clopper2006;
@Clopper2007; @Cohen2012; @Eckert2012; @Labov1972; @Remez1997; @Thomas2002].  Groupings that
are _socially meaningful_ can thus be informative and useful to track with
respect to the overall communicative goal, which might include the robust
transmission of social identity, emotional states, and more.  This means that
knowledge of the joint distribution of acoustic-phonetic cues and a
socio-indexical grouping can have high utility, even if ignoring that grouping
has a negligible effect on speech recognition, as long as the corresponding cue
distributions carry some information about relevant social variables.
@KleinschmidtInPress2017 discuss this further
and extend the ideal adapter to social inferences.
That work---based on the same datasets I analyze here---found two examples where a
socio-indexical variable can be inferred based on cue distributions, but which I
found here to
provide little if any additional utility for speech recognition.  The first is
dialect (based on vowel F1×F2) and the second is age (older than 40/younger than
30, based on VOT distributions).

An additional consideration is that listeners are not simply told which
variables are informative and which are not. They must actually _learn_ what
distributions are actually worth tracking. Moreover, every listener's experience
with talker variability will be different, and so a variable that is informative
in one listener's experience may be irrelevant in another's.  For example, the
predictions I have derived here about the relative utility of different
grouping variables for speech recognition would hold for listeners whose
language experience is similar to that represented in the databases I employed.
This has a two main consequences for the predictions that the ideal adapter
makes.  First, this means that listeners' response to talker variability should
depend on their own, particular experience with talker variability.
@Clopper2006 shows some evidence that this is indeed the case.  Second, in order
to derive predictions for a specific listener, we would need to know
more details of their own personal history with talker variability.  This is a
difficult task, but the ideal adapter also provides tools to probe listeners'
prior beliefs _directly_ [for first steps, see @Kleinschmidt2016].

Finally, I note that listeners' associations between linguistic and
socio-indexical variables do not always seem to be based on _objective_
informativity of those variables.  Rather, some variants can become
disproportionately _salient_ or _enregistered_ [@Eckert2012; @Podesva2001;
@Podesva2007; @Foulkes2015a; @Levon2014; @Jaeger2016].  These deviations between
objective informativity and subjective salience remain to be explained and
specified in more detail, as well as what connection---if any---there is between
listeners _explicit_ social perceptions and their ability to adapt to
socially-indexed linguistic variation.  The methods proposed here provide a set
of tools for assessing objective informativity/utility, a critical first step in
understanding this relationship.

## Consequences for adapting to unfamiliar talkers

The results of this study also speak to how listeners might adapt to an
unfamiliar talker. The ideal adapter links informativity and utility to
adaptation, and the results here allow us to make more specific predictions
based on the ideal adapter, in two ways.

First, the informativity of talker identity is a measure of the variability
across talkers. When talker identity is highly informative, there is more
variability across talkers, and the ideal adapter predicts that prior experience
with other talkers will be less relevant, resulting in faster and more complete
adaptation to an unfamiliar talker. I found here that talker identity is less
informative about VOT distributions than it is for vowel formant
distributions. Hence, the ideal adapter predicts that listeners will adapt to
talker-specific VOT distributions more slowly, and be more constrained by prior
experience with other talkers, compared to talker-specific formant
distributions. 

While I am not aware of a direct quantitative test of this specific
prediction, existing evidence provides indirect support for it.  @Kraljic2007
found much smaller recalibration effects for a VOT contrast (on the order
of 5% changes in classification) compared to a fricative contrast (around 30%)
with on the same amount of exposure to each.  Studies on recalibration of a
word-medial /b/-/d/ contrast---which is partially cued by formant frequencies, like vowel
identity---show recalibration effects of similar magnitudes to fricatives
[@Kleinschmidt2015;
@Vroomen2007].  This prediction is also borne out indirectly by studies that
have inferred the strength of listeners' prior expectations based on their
adaptation behavior [@Kleinschmidt2015; @Kleinschmidt2016].  That work finds
that listeners' prior expectations are stronger---as measured by an "effective
prior sample size"---when adapting to a voicing contrast (like /b/-/p/) than a
stop consonant place of articular contrast (like /b/-/d/).

Second, the informativity of socio-indexical grouping variables is linked to
_generalization_ across talkers: if two talkers are from groups that tend to
differ, listeners should be more inclined _a priori_ to treat them separately
and not generalize from experience with one talker to the other. Likewise, if
two talkers are from the same group, listeners _should_ generalize. I found
that talker gender is informative about vowel formant distributions, but not
about VOT, which means that listeners _should_ (absent other information)
generalize from a male to a female talker (and vice-versa) for a voicing
contrast, but _not_ for a vowel contrast. Listeners do, in fact, tend to
generalize voicing recalibration across talkers of different genders
[@Kraljic2006; @Kraljic2007]. While there is to my knowledge no data on
cross-talker generalization for vowel recalibration, listeners tend not to
generalize across talkers for recalibration of fricatives
[@Eisner2005; @Kraljic2007], which (like vowels) are cued by spectral cues that
vary across talkers and by gender [@Newman2001; @Jongman2000; @McMurray2011a].

Third, and conversely, listeners should be _more_ likely to generalize between
two talkers who are both members of the same informative group.  In the absence of
evidence that two talkers from the same group (e.g. two males) produce a
contrast differently, experience with one provides an informative starting point
for comprehending (and adapting to) the other.  There is evidence along these
lines as well: @VanderZande2014 found that listeners generalize from experience
with one male talker's pronunciation of a /b/-/d/ contrast to another,
unfamiliar male.  Note that such generalization should depend on how informative
(and variable) a grouping variability like gender is _across contexts_, since
generalization from experience with one other male talker in an experimental
context is very different from generalization from _all_ other male talkers
across all contexts.
`r label("r1-context-generalization")`

Finally, these predictions are best thought of as _prior biases_ that might be
overcome with enough of the right kind of evidence [@Kleinschmidt2015]. For
instance, listeners can overcome their bias to generalize experience with VOT
and learn talker-specific VOT distributions, but it requires hundreds of
observations from talkers who produce very different VOT distributions
[@Munson2011]. Likewise, listeners will generalize recalibration of a fricative
contrast from a female to a male talker when test stimuli are selected to
increase perceptual similarity between the two test continua [@Reinisch2014].

## A lower bound

These results constitute a _lower bound_ on the informativity or utility of
different levels of socio-indexical grouping. This is the case above and beyond
the limitations imposed by the database that I discussed above (which required
subsampling talkers in order to meaningfully compare accuracy across grouping
variables).

Here, cue distributions for a particular group are modeled as a _single_ normal
distribution over observed cue values. In reality, a hierarchical model is more
appropriate, since different levels of grouping can nest within each other, or
combine orthogonally. For instance, each dialect group is likely better modeled
as a _mixture_ of talker-specific distributions, which each exhibit dialect
features to a varying degree. This is especially important for _adaptation_ to
an unfamiliar talker, since a group-level distribution conflates _within_ and
_between_ talker variation, both of which have separate roles to play in belief
updating.

The approach to group-level modeling that I take here is roughly equivalent to
the _posterior predictive_ distribution of a fully hierarchical model, which
integrates over lower levels of grouping to provide a single distribution of
cues given the group (and phonetic category). This corresponds to the best guess
a listener would have _before_ hearing anything from an unfamiliar talker, if
the only information they had about that talker was their group membership. As
the listener hears more cue values from the talker, the hierarchical nature of
grouping structure becomes more important and can provide (in principle) a
significant advantage over what I measured here. But modeling this process is
quite a bit more complicated and is left for future work. Nevertheless,
modeling each category as a single, "flat" distribution per group may well prove
a useful approximation, or even a boundedly-rational model of how listeners take
advantage of different levels of grouping structure
[and similar approaches have been used in, e.g., motor control; @Kording2007].

## Consequences for perspectives on normalization

`r label("normalization-gen-dis")`

Normalizing vowel formants with respect to each talker's overall mean and
variance substantially reduces the amount of talker variability, and also
changes the _structure_ of that variability: gender matters much less, while the
effects of dialect become more apparent.  Much of the work on vowel
normalization treats normalization as a low-level auditory adaptation or
habituation process that eliminates the need for active inferences on the
listener's part [e.g., @Holt2006c; @Huang2012; @Laing2012; @Nearey1989; for a
review see @WeatherholtzInPress].  But low-level sensory adaptation is
increasingly recognized as a sort of distributional learning, much like the
ideal adapter proposes for speech recognition [for a review of these parallels,
see @Kleinschmidt2015b].

I used normalization as a methodological tool, but it would be possible to
treat the normalization parameters as another aspect of a talker's particular
language model that must be inferred, just like the means and (co-)variances of
various individual vowel distributions.  That is, it is possible that an ideal
adapter would do better by learning talker-/group-specific distributions in a
normalized space, and additionally inferring the normalization factors (shift,
scaling, etc.) for each talker they encounter.  If this parallel is appropriate,
then it suggests a more complex interaction between normalization and
adaptation/perceptual learning as strategies for coping with talker variability,
and makes a number of predictions.  For instance, instead of just taking a
running average of recent spectral content [@Huang2012] or using extreme vowels
as "anchors" [as in many normalization methods; @Flynn2011], normalization could
be accomplished much more efficiently by leveraging category-level information
(which is often provided by, e.g. lexical context) and knowledge of cue
distributions in normalized space: a single token of any vowel (with the
category known) can provide enough information to get a reasonably good guess
about the talker's normalization factors.  This in turn predicts sensitivity to _both_
the un-normalized formant frequencies _and_ the normalized ones.  In this case,
group-level expectations that are only informative about distributions in
normalized space (e.g., dialect for vowels) could nevertheless help with
adaptation, even before a talker's entire cue space is known.

Furthermore, @Chodroff2017 found that talker variation in VOT could also be
largely characterized in terms of overall shifts/scaling of VOT distributions
(as evidenced by large, positive correlations across talkers between the means
and variances of different categories).  This suggests that tracking
talker-specific normalization factors may be a generally useful strategy
across different phonetic contrasts (or even features).  That is, listeners may
benefit from factoring talker variation into components that are shared across
categories and components that are shared across talkers (as I've examined
here).  But this parallel remains to be investigated in future work.


# Conclusion

I have demonstrated methods to quantify the amount and structure of talker
variability in phonetic category-specific cue distributions.  These methods are
derived directly from the ideal adapter framework [@Kleinschmidt2015] which
treats speech perception as a process of inference under uncertainty and
variability.  The results I present here for word-initial stop voicing (cued by
VOT) and vowel identity (cued by F1×F2) are a first step towards making
quantitative predictions with the ideal adapter about how listeners cope with
different aspects of talker variability.  They also provide a way of formalizing
the salience or relevance of socio-indexical information that exemplar/episodic
theories propose is stored alongside acoustic traces [e.g., @Sumner2014].
Finally, together with similar work showing that socio-indexical judgements can
be modeled as the same kind of inference under uncertainty
[@KleinschmidtInPress2017], this work suggests a framework for unifying
psycholinguistic and sociolinguistic perspectives on talker variability.

```{r session-info, results="markup", message=TRUE}

if (opts_knit$get("rmarkdown.pandoc.to") != 'latex') {
  options(width=100)
  devtools::session_info()
}

```

```{r refs, results="asis", echo=FALSE}

if (opts_knit$get("rmarkdown.pandoc.to") != 'latex') {
  cat("# References")
}

```