Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Dec 6.
Published in final edited form as: Neuron. 2017 Nov 16;96(5):1168–1177.e5. doi: 10.1016/j.neuron.2017.10.019

Discrete circuits support generalized versus context-specific vocal learning in the songbird

Lucas Y Tian 1,2,3,5,*, Michael S Brainard 1,2,3,4
PMCID: PMC5731642  NIHMSID: NIHMS913384  PMID: 29154128

SUMMARY

Motor skills depend on the reuse of individual gestures in multiple sequential contexts (e.g., a single phoneme in different words). Yet optimal performance requires that a given gesture be modified appropriately depending on the sequence in which it occurs. To investigate the neural architecture underlying such context-dependent modifications, we studied Bengalese finch song, a skill that, like speech, consists of variable sequences of “syllables.” We found that when birds are instructed to modify a syllable in one sequential context, learning generalizes across contexts; however, if unique instruction is provided in different contexts, learning is specific for each context. Using localized inactivation of a cortical-basal ganglia circuit specialized for song, we show this balance between generalization and specificity reflects a hierarchical organization of neural substrates. Primary motor circuitry encodes a “core” syllable representation that contributes to generalization, while top-down input from cortical-basal ganglia circuitry biases this representation to enable context-specific learning.

Keywords: Birdsong, basal ganglia, motor skill learning, vocal learning, reinforcement learning, context-dependent learning, motor sequences, motor adaptation

eTOC blurb

Tian and Brainard investigate context-dependent vocal learning in birdsong. They find that learned syllable modifications that differ across sequential contexts reflect sequence-specific biasing from cortical-basal ganglia circuitry, while modifications that generalize reflect changes to a core syllable representation in motor circuitry.

INTRODUCTION

The efficient learning and execution of motor skills, such as speech and musicianship, depends on the ability to flexibly reorder a discrete set of distinct motor gestures (e.g., phonemes in speech, or finger movements in piano playing) into a larger set of appropriate sequences (Diedrichsen and Kornysheva, 2015). Reuse of a given gesture in multiple sequential contexts supports efficient learning because it permits a generally-applicable adaptive modification to a given gesture - for instance, during initial learning of a skill or in response to weakening of muscles - to be expressed not only in the sequence in which it was learned, but also in other sequences that incorporate the gesture. However, optimal performance of motor sequences depends not only on the ability to generalize gesture modifications across sequential contexts, but also on the ability to modify a given gesture differentially for the distinct contexts in which it is performed. This is prominent in speech, in which the execution of a given phoneme can be systematically varied depending on the word in which it is embedded. Such natural context-dependent modification of gestures (“coarticulation”) is thought to enable the smooth and rapid performance of speech (Bouchard and Chang, 2014) and skills as diverse as piano playing (Engel et al., 1997), sign language (Jerde et al., 2003), and reaching and grasping (Ansuini et al., 2008; Shah et al., 2013; Sosnik et al., 2004).

The idea that a flexible balance of generalization and specificity underlies the reuse of individual motor gestures is strongly supported by human motor adaptation studies. For instance, if consistent external perturbation of speech or reaching movements is imposed in only one sequential context, subjects exhibit corrective adaptations of the movement that tend to generalize to other contexts (Houde and Jordan, 1998; Howard and Franklin, 2015; Rochet-Capellan et al., 2012). However, such generalization is typically only partial, indicating some natural capacity to limit adaptation specifically to the trained context. Moreover, if different directions of perturbation are imposed in distinct sequential contexts, then subjects can learn multiple sequence-specific modifications to a given gesture, allowing it to be executed appropriately in each context (Howard et al., 2012; Rochet-Capellan and Ostry, 2011; Wainscott et al., 2004). Collectively, these behavioral observations raise the question of what neural architectures might support the efficient reuse of individual gestures across contexts, while also enabling the modulation of a given gesture to optimize its performance depending on context.

Here we investigate the neural mechanisms underlying the balance between generalization and specificity of learning in adult Bengalese finch song. Bengalese finch song, like human speech, consists of learned sequences formed by reordering a discrete set of vocal gestures, termed syllables, so that a given syllable can be expressed in different sequential contexts [Figure 1A; (Doupe and Kuhl, 1999)]. Moreover, experimentally induced sensory errors during the production of a syllable in one sequential context drive adaptation that exhibits partial generalization to the production of the same syllable in other contexts (Hoffmann and Sober, 2014). In our study, we first show that, as for human speech, Bengalese finches can learn to modify individual syllables differentially depending on context. We then used inactivation of the anterior forebrain pathway (AFP), a cortical-basal ganglia circuit dedicated to song, to reveal a hierarchical organization of neural substrates, in which the AFP enables such context-specific learning by biasing a more context-independent syllable representation in downstream motor circuitry. Moreover, when birds are instructed to modify syllables in a general manner across contexts, learning gradually becomes encoded in primary motor circuitry, but when instruction is context-specific, learning remains dependent on biasing signals from the AFP.

Figure 1. Learning driven in a single target context partially generalizes to non-target contexts.

Figure 1

(A) Spectrogram of an example song with syllables labeled and transition diagram representing three contexts for the syllable B. Scale bars, 250 ms (horizontal) and 2 kHz (vertical).

(B) Schematic of training in a single context. White noise feedback (“hit”) was provided to renditions of the target syllable B in the target context JAB (grey) when fundamental frequency (FF) of B was below a threshold (red fill in histogram). Feedback was not provided (“escape”) when B was sung in non-target contexts (blue), or when a different syllable (e.g., G) was sung in any context (brown).

(C) Learning over two days of baseline (“WN off”) and four days of training [“WN on” for the target context; arrow direction represents the direction of FF shift that escapes WN feedback] for the experiment depicted in (B). Each datapoint represents a single rendition of the target syllable in the target context (JAB, grey), the target syllable in a non-target context (AAB, blue), or a different syllable (BDG, brown). Renditions within the red shading were below the FF threshold and were thus “hit” with white noise (WN). Mean ± SD FF for each day is overlaid.

(D) Summary across experiments of learning for target syllables in the target context (n=36), the target syllable in non-target contexts (n=48), and different syllables in any context (n=235). Bars represent mean ± SEM learning (n=36 experiments, in 12 birds, targeting a syllable in a single context), defined as mean FF on days three and four of training minus mean FF on the last two baseline days. ***, p < 0.0005, n.s., p > 0.05, signed-rank test; ###, p < 0.0005, rank-sum test

(E) Left: Generalization as a function of similarity between target and non-target contexts. Contextual similarity was defined by the number of syllables immediately preceding the target syllable that were shared between the target context (“XYZB”) and non-target contexts. Variation in contextual similarity from low (“nnnB”, no syllables shared, n=28), to medium (“nnZB”, 1 syllable shared, n=7), to high (“nYZB”, 2 syllables shared, n=3) accounted for significant variation in the magnitude of generalization (simple linear regression, p < 5 × 10−5, r2 = 0.40, slope = −0.33). Bars represent mean ± SEM. *, ***, p < 0.05, 0.0005, corrected for multiple comparisons using the Tukey-Kramer method on results from ANOVA. Right: histogram of generalization for all cases of the target syllable in non-target contexts (mean, 23.2 ± 5.4%).

See also Figures S1 and S2.

RESULTS

Learning driven in a single target context partially generalizes to non-target contexts

We first evaluated whether birds trained to modify the fundamental frequency (FF), or pitch, of a given syllable in one context would spontaneously apply the learned changes to the same syllable in other contexts. We used a negative reinforcement paradigm that requires birds to gradually shift the FF of a “target” syllable in order to escape white noise (WN) delivered whenever the FF of a rendition of that syllable exceeds a set threshold (Andalman and Fee, 2009; Charlesworth et al., 2011, 2012; Tumer and Brainard, 2007; Warren et al., 2011). This instructive WN reinforcement was provided to birds only when the target syllable was sung in a single sequential context (Figure 1B, “target context”); reinforcement was withheld when the target syllable was sung in any other sequence (“non-target contexts”) and for all other types of syllables (“different syllables”, see STAR Methods).

Context-dependent reinforcement, delivered in a single target context, drove changes in the FF of the target syllable that generalized to non-target contexts (Figure 1C, example experiment; Figure 1D, summary, signed-rank test of FF change in target context: p < 5 × 10−7; signed-rank test of FF change in non-target context: p < 0.0005). However, the change in FF in non-target contexts averaged only 23% of the change in the corresponding target contexts, indicating that there was some natural tendency for context specificity in learning (Figure 1C, D, n = 36 experiments, rank-sum test of FF change in target vs. non-target context: p < 5 × 10−9; Figure 1E right, histogram of percent generalization). In contrast to the partial generalization observed for the target syllable, we did not detect any learning for syllables that were categorically different from the target syllable (Figure 1C, D, signed-rank test: p = 0.34, Figure S1A, Kolmogorov-Smirnov test comparing distributions of learning vs. expected drift of FF: p = 0.39). Hence, consistent with previous observations in both human and songbird studies (Hoffmann and Sober, 2014; Houde and Jordan, 1998; Rochet-Capellan et al., 2012), we found that learning driven in a single context partially generalizes to other contexts.

We next investigated factors that could account for differences in the magnitude of generalization across experiments (Figure 1E). For each target syllable, we examined a variety of measures of similarity between the target and non-target contexts that have previously been studied for their potential explanatory value with respect to magnitude of generalization (Caudrelier et al., 2016; Hoffmann and Sober, 2014; Howard and Franklin, 2015; Rochet-Capellan et al., 2012; Shadmehr and Mussa-Ivaldi, 1994). We found that the magnitude of generalization for a given non-target context could be explained, to a large extent, by the similarity between the identity of the syllables in the sequences that made up the target and non-target contexts (“contextual similarity”, Figure 1E). Greater contextual similarity corresponded with greater generalization, with only 13% generalization in cases with low contextual similarity, but 40% and 84% generalization for cases with intermediate and high levels of contextual similarity (Figure 1E, simple linear regression: p < 5 × 10−5, r2 = 0.40). Further regression analyses confirmed that contextual similarity had strong explanatory power, while other measures we examined provided no significant additional power, in accounting for variation in the magnitude of generalization across experiments (Figure S2 reports tests of explanatory value for acoustic distance, rendition-by rendition correlation, and proximity). This finding parallels observations for human speech and reach adaptation that generalization tends to be greater when gestures are produced in sequential contexts that are more similar to the context in which learning is driven (Caudrelier et al., 2016; Howard and Franklin, 2015).

Independent context-specific learning for the same syllable in two contexts

To determine whether partial generalization to non-target contexts reflects an inherently limited ability to express separate learning in different contexts, we asked whether we could override the natural pattern of generalization by instructing opposing modifications of a syllable in two contexts. For each learning trajectory, we first drove learning in only one target context (“single context phase”), which, as described above, resulted in partial generalization of learning to other contexts (Figure 2A, example experiment; Figure 2B, summary). We then initiated reinforcement in a second context, with the FF contingency opposite to that in the first context, while maintaining the contingency in the first context (Figure 2A, B, “dual context phase”). During the dual context phase, FF in the second context changed in the direction opposing initial learning by an average of 109.8 ± 19.1 Hz (Figure 2C, n = 13 experiments, signed-rank test: p < 0.0005). By the end of the dual context phase, FF in the second context had shifted downward past its original baseline (Figure 2B, signed-rank test: p < 0.05), and this shift was even more pronounced in the subset of experiments for which training in the dual context phase was extended past five days (Figure S3A). In contrast, learning that had occurred in the first context was maintained with no significant change (Figure 2C, n = 13, signed-rank test: p = 0.31; we also did not detect any significant changes to FF of different type syllables, Figure S3B). Correspondingly, the separation between FF of the target syllable in the two contexts increased from 114.4 ± 18.8 Hz at the end of the single context phase to 211.0 ± 30.0 Hz at the end of the dual context phase (p < 0.0005, n = 13, signed-rank test). These results demonstrate that Bengalese finches have a capacity for independent, context-specific modifications of a given syllable, mirroring findings for human speech and reach adaptation (Howard et al., 2012; Rochet-Capellan and Ostry, 2011).

Figure 2. Independent learning for the same syllable in two contexts.

Figure 2

(A) Example experiment. In the single context phase, the FF of B was driven up in the first context, JAB. In the dual context phase, the FF of the same syllable, B, was driven down in the second context, AAB, while the reinforcement contingency in the first context was maintained. Dots indicate the FF of single renditions, with overlaid thick lines representing mean ± SD.

(B) Across-experiment mean ± SEM learning in the first (top) and second (bottom) contexts. Experiments were aligned to the transition from the single context phase to the dual context phase (n = 13 experiments, each including a single and dual context phase, 9 birds; *, **, ***, p < 0.05, 0.005, 0.0005, signed-rank test vs. the last single context day; #, p < 0.05 signed-rank test vs. 0 Hz).

(C) Learning during the dual context phase for the first and second contexts. Learning was measured as the change in FF, on days 4–5 of the dual context phase, relative to FF on the last 2 days of the single context phase (***, p < 0.0005, signed-rank test; ###, p < 0.0005, rank-sum test)

See also Figure S3

A cortical-basal ganglia circuit, the anterior forebrain pathway, adaptively biases motor output in a context-specific manner

We next investigated the neural mechanisms underlying generalization and specificity in context-dependent learning. To do so, we took advantage of previous work that has elucidated circuitry for production and plasticity of song. The song motor pathway (Figure 3A) is required for the moment-by-moment production of learned song (Leonardo and Fee, 2005; Nottebohm et al., 1976; Simpson and Vicario, 1990; Vu et al., 1994). In contrast, the anterior forebrain pathway (AFP, Figure 3A), a basal ganglia-thalamo-cortical circuit specialized for song, is not required for the normal production of adult song, but is required both for developmental song learning and modifications to adult song (Andalman and Fee, 2009; Bottjer et al., 1984; Brainard and Doupe, 2000; Warren et al., 2011). Using a similar WN reinforcement paradigm, previous work (Andalman and Fee, 2009; Warren et al., 2011) has shown that during initial stages of learning, inactivation of the AFP causes a reversion of FF towards baseline values (Figure 3B, “Early”), but that over a period of maintained learning, the effects of inactivating the AFP gradually diminish (Figure 3B, “Late”). These findings support a model in which WN-driven changes to the FF of targeted syllables are initially directed by biasing signals from the AFP acting upon the downstream motor pathway (Figure 3B, “AFP biasing”; thick green arrow from AFP to RA) but that this learning is gradually transferred to the motor pathway in a process of “systems consolidation” (Figure 3B, “Consolidated to MP”, filled green circle in RA). If these same mechanisms contribute to all adaptive modifications of song, then we would expect in our experiments that the early expression of learning in both the target and non-target contexts would rely on biasing signals from the AFP.

Figure 3. Neural circuits that contribute to song production and learning.

Figure 3

(A) Top: song system nuclei schematized according to anatomical organization. Blue, green and red subdivisions refer to “cortical” (pallial), basal ganglia, and thalamic subdivisions, respectively. Bottom: the motor pathway (red) consists of the cortical nuclei HVC (used as a proper name) and RA (robust nucleus of the arcopallium). The anterior forebrain pathway (AFP, tan) consists of the striatopallidal nucleus Area X (used as a proper name), the thalamic nucleus DLM (medial dorsolateral nucleus of thalamus), and the frontal cortical nucleus LMAN (lateral magnocellular nucleus of the anterior nidopallium).

(B) Schematic based on previous work of the contributions of the AFP and motor pathway (MP) to the expression of WN driven learning for a syllable sung in stereotyped sequences (i.e., only ever sung in one context) (Andalman and Fee, 2009; Warren et al., 2011). FF is driven from baseline and then maintained at a fixed value, while LMAN is periodically inactivated by muscimol infusion. The amount of total learning (black lines and bars, “PBS”) that persists during LMAN inactivation (red lines and bars, ”MUSC”) is construed as the motor pathway (MP) contribution to the expression of learning (red arrow), while the difference between total learning and the MP contribution is construed as the AFP contribution to the expression of learning (gold arrow). During “baseline”, LMAN inactivation has no consistent effect on FF, indicating that well-learned song structure is largely encoded in the downstream motor pathway. During “early” learning, LMAN inactivation results in a reversion of learning back towards baseline, indicating that the expression of recent learning depends on biasing signals from the AFP acting on the downstream motor pathway (“AFP biasing”; thick green arrow from AFP to RA). During a “late” period of maintained learning, LMAN inactivation no longer causes a reversion of learning, indicating that learning has been transferred to the motor pathway (“Consolidated to MP”, filled green circle in RA).

To assess the extent to which AFP bias contributes to the expression of learning in target and non-target contexts, we used the previously established approach of AFP inactivation. We first drove learning in a single target context and then transiently blocked AFP output by infusing the GABAA receptor agonist muscimol into LMAN. As previously observed for learning in a single context (Andalman and Fee, 2009; Warren et al., 2011), we found that blocking AFP output caused a strong and consistent reduction in the magnitude of learning expressed in the target context (Figure 4A top, example experiment; Figure 4B top, summary, n = 13 experiments, 48% reversion from a mean of 138.8 ± 11.3 Hz during PBS to 72.2 ± 8.1 Hz following LMAN inactivation, signed-rank test: p < 0.0005). This reversion in the expression of learning indicates that the AFP was providing a bias in the target context of ~67 Hz in the adaptive direction (i.e., the direction that escapes WN). In striking contrast, although there was significant generalization of learning to non-target contexts, the expression of that generalized learning did not depend on the AFP (Figure 4A bottom, example experiment; Figure 4B bottom, summary, n = 13 experiments, 14% shift from 42.7 ± 10.3 Hz during PBS versus 36.7 ± 12.2 Hz following LMAN inactivation, signed-rank test: p = 0.50). A direct comparison of reversion in target and non-target contexts in the same experiments confirmed that AFP bias was highly specific to the target context (Figure 4D, signed-rank test: p < 0.0005). Moreover, this specificity did not simply reflect less learning in non-target contexts, as the differential effect of LMAN inactivation on expression of learning in target vs. non-target contexts persisted both in analysis of experiments in which there was a large amount of generalization (as in the example experiment of Figure 4A and summary data in Figure S4C) and in analysis of the ratio of effects of inactivation on expression of learning in target and non-target contexts (Figure S4D). LMAN inactivation also did not have a significant effect on FF for different-type syllables (mean learning expressed during PBS infusion, 5.5 ± 2.9 Hz, and during muscimol infusion, 3.88 ± 2.6 Hz, signed-rank test: p = 0.77). Thus, the AFP contributes to the expression of learning by providing a motor bias that is highly specific for the target versus non-target context.

Figure 4. The AFP adaptively biases motor output in a context-specific manner.

Figure 4

(A) Example experiment in which FF of B was driven up in the target context (BCCB, top, grey), while reinforcement was withheld in the non-target context (DCCB, bottom, blue). Black lines represent daily mean ± SEM FF during infusion of vehicle (PBS) into LMAN. Red squares represent daily mean ± SEM FF during infusion of muscimol. Inset: FF of individual renditions for a single inactivation day; lines represent mean FF, and blue scale bar represents 2 hrs.

(B) Muscimol infusion caused significant reversion of learning in the target context (top, ***, p < 0.0005, signed-rank test), but not in non-target contexts (bottom, p = 0.50, signed-rank test) on days 4–10 of training (n = 13 experiments in 7 birds). Experimental data (bars) are overlaid on a schematic of the learning trajectory (dashed lines).

(C) A model for how the AFP and motor pathway contribute to learning in target and non-target contexts. In this model, the motor pathway has a “core” representation of the target syllable that is largely overlapping between contexts (schematized by overlapping circles in RA), while the AFP has context-specific representations of the appropriate modifications of the syllable for each context (schematized by non-overlapping circles in the AFP). In the target context (top, “xB”) the AFP provides a strong biasing signal to the target syllable B (thick green arrow from AFP to RA), and over time this bias begins to drive a consolidation of changes in the motor pathway representation of the target syllable (light green circle in RA reflecting partial consolidation of changes to the MP representation of the target syllable). In the non-target context (bottom, “yB”), there is no AFP bias. However, because the motor pathway representation of the target syllable overlaps substantially between contexts, the gradual modification of the MP representation in the target context contributes to the generalization of learning in the non-target context. As a result, the expression of learning in the target context depends on contributions from both the MP and AFP (red and gold bar, top), but learning that generalizes to the non-target context depends only on contributions from the MP (red bar, bottom).

(D) Mean ± SEM contribution of the AFP to expression of learning (AFP bias; n=13 experiments, identical to (B)). AFP bias in the target context (p < 0.0005) but not in the non-target context (p = 0.50) was significantly different from 0 (signed-rank test). ***, p < 0.0005, signed-rank test comparing target and non-target contexts.

See also Figure S4.

These results raise the question of how generalization of learning to non-target contexts arises. We hypothesized that while the AFP provides biasing signals that are context specific, the motor pathway contains a more overlapping representation of the target syllable that is shared across contexts. According to this model (Figure 4C), generalization arises because AFP biasing signals specific to the target context drive a gradual modification of the overlapping motor pathway representation through the process of consolidation. Our results thus suggest a hierarchical organization, in which the AFP provides context-specific biasing signals that modulate and gradually modify a more context-independent, “core” syllable representation in downstream motor circuitry.

Conflicting AFP bias interferes with consolidation for context-specific learning

Our model makes a prediction about the nature of adaptive modifications that are transferred, or consolidate, to the motor pathway during learning; in particular, for context-specific learning there should be reduced consolidation, because conflicting, context-specific biasing signals would exert interfering influences on the overlapping syllable representation in the motor pathway. In contrast to our model, if the motor pathway contains separate, non-overlapping representations of a given syllable in each context, then consolidation of learning should proceed equally for context-independent and context-specific learning. To test our model predictions, we carried out experiments in which birds were instructed to either shift FF in the same direction in all contexts (“Congruent training”), or shift FF in opposite directions in different contexts (“Incongruent training”). We supposed that for Congruent training, the AFP would generate similarly directed biasing signals in each of the two contexts during the early phase of learning (Figure 5Ai, “Early”) that would act coherently to drive a strong transfer of context-independent changes to the downstream motor pathway (Figure 5Ai, “Late”). In contrast, for Incongruent training, the AFP would generate oppositely directed biasing signals in the two contexts (Figure 5Aii, “Early”), that would antagonize each other in converging onto a shared downstream motor pathway representation of the syllable, and thereby interfere with transfer of learning (Figure 5Aii, “Late”).

Figure 5. Conflicting AFP bias interferes with consolidation for context-specific learning.

Figure 5

(Ai, Aii) Model predictions for Congruent and Incongruent training. For Congruent training (Ai), we predicted that during early learning there would be similarly directed AFP bias in both context 1(xB) and context 2 (yB) for the target syllable B (“Early”, thick green arrows from both contexts 1 and 2). These biasing signals would act synergistically to drive strong consolidation in the overlapping downstream motor pathway representation of the syllable (“Late”, dark green circles in RA), so that expression of learning would become independent of the AFP. For Incongruent training (Aii), we predicted that during early learning there would be oppositely directed AFP bias across contexts (“Early”, thick green arrow in context 1 biasing FF upwards, and thick purple arrow in context 2 biasing FF downwards). These biasing signals would drive opposing modifications to the overlapping motor pathway representation and impair consolidation (“Late”, light circles in RA), so that expression of learning in both contexts would remain dependent on context-specific AFP biasing signals

(Bi, Bii) Summary data for Congruent and Incongruent training (n = 5 Congruent experiments and 7 Incongruent experiments in 6 birds). Bar plots showing mean ± SEM effects of LMAN inactivation at early and late time points of maintained learning are overlaid on lines schematizing trajectories of learning for Congruent (Bi) and Incongruent (Bii) experiments (see Methods). Early and late periods are defined relative to a maintained learning period (see Methods).

(C) AFP bias in the early period (days 1–4) of maintained learning was highly context-specific and appropriate for each type of training (sample sizes as in (B)). Bars represent mean (± SEM) AFP bias, measured as the amount by which learning reverted towards baseline while LMAN was inactivated. *, p < 0.05, signed-rank test; ##, p < 0.005, rank-sum test. AFP bias measured in a separate set of experiments driving learning in only a single target context is reproduced from Figure 4C and plotted here for comparison (n = 13 experiments in 7 birds, ***, p < 0.0005, signed-rank test).

(D) Consolidation in the late period (days 5–6) of maintained learning was strong for Congruent training, but reduced for Incongruent training. Bars represent the mean (± SEM) percentage of learning that was still expressed when AFP output was blocked (and was thus dependent on the motor pathway and not on the AFP). Dashed line represents magnitude of consolidation from a previous study driving learning for syllables that were only sung in stereotyped sequences (Warren et al., 2011). Data are shown for learning in the first context, because the magnitude and trajectory of learning in the first context was matched between training types (see Methods); however, significance of these results were unaffected if we used both contexts (Figure S5A). *, p < 0.05, signed-rank test vs. 100%; ##, p < 0.005, rank-sum test.

See also Figure S5.

We first assessed whether Congruent versus Incongruent training would indeed generate coherent versus antagonistic AFP biasing signals. We measured AFP bias during the first four days of maintained learning (“Early” in Figures 5A, B), when previous work has shown that the expression of learning depends substantially on AFP bias (Warren et al., 2011). Targeted LMAN inactivation during this early period revealed that during Congruent Training, AFP bias was in the same direction in each context, while during Incongruent training, AFP bias was in opposite directions in different contexts (Figure 5C). These results further demonstrate that the presence and direction of AFP bias accurately reflects the presence and direction of context-specific instruction, even in an extreme case in which learning is oppositely directed in distinct contexts. They additionally establish an experimental framework for determining whether conflicting AFP bias during Incongruent training interferes with the transfer of learning to the motor pathway.

We assessed consolidation for both Congruent and Incongruent training during the late period of maintained learning (days 5–6), when previous work indicates that the expression of learning in a single context becomes largely independent of the AFP (Warren et al., 2011). For Congruent training, consolidation was not significantly different from 100% (Figure 5D, 84.9 ± 5.9%, mean ± SEM, signed-rank test: p = 0.12), and was indistinguishable from consolidation previously reported for syllables that are sung in only a single context [Figure 5D, 84.9 ± 7.1% in (Warren et al., 2011)]. In contrast, for Incongruent training, consolidation was both significantly less than 100% and significantly reduced relative to that for Congruent training (Figure 5D, 44.1 ± 12.0%, p < 0.05, signed-rank test vs. 100%; p < 0.005, rank-sum test vs. Congruent). These data indicate that under conditions in which generalization is appropriate, learning rapidly becomes transferred to the motor pathway. In contrast, under conditions when context-specific modifications to a gesture are required, transfer to the motor pathway is impaired and there is an ongoing requirement of biasing signals from the AFP for the expression of learning.

DISCUSSION

The reuse of individual gestures in multiple motor sequences allows efficient generalization of adaptive modifications across contexts (Diedrichsen and Kornysheva, 2015). At the same time, optimal performance requires that a given gesture be differentially modified depending on the specific context in which it is produced. Using Bengalese finch song as a model system, we demonstrate that the balance between generalization and specificity in the deployment of motor gestures arises from a hierarchical organization within the nervous system; pharmacological inactivation of the anterior forebrain pathway (AFP) revealed that biasing signals from the AFP that are highly specific and appropriate for each context modulate a more context-independent representation of syllable structure in the downstream primary motor pathway (Figures 4C, 5A). When similar modifications to a syllable were instructed across contexts, generalized learning was gradually transferred to the motor pathway, but when distinct modifications were instructed across contexts, this transfer of learning was impaired and the context-specific expression of learning remained highly dependent on the AFP (Figure 5D). These findings indicate that the primary motor pathway encodes a relatively context-independent or “core” representation of a given syllable, while frontal cortical-basal ganglia circuitry provides top-down biasing signals that enable appropriate, context-specific modulation and updating of this core representation.

Our finding that the AFP injects a context-specific biasing signal into the motor pathway indicates a role for the AFP in integrating contextual signals (reflecting the current syllable and sequence) with instructive signals (reflecting the appropriate FF for each context) to enable context-dependent vocal learning (Figure 4C, 5A). Signals encoding sequential context may be conveyed from neurons in the cortical nucleus HVC that send an efference copy of premotor commands to the basal ganglia nucleus Area X (HVCX neurons) (Fee and Goldberg, 2011; Fujimoto et al., 2011; Mooney, 2014); the firing patterns of these neurons reflect not only the identity of the syllable currently being produced, but also that of preceding syllables (Fujimoto et al., 2011). Signals encoding rendition-by-rendition variation in the FF of targeted syllables are potentially generated within Area X (Woolley et al., 2014) or relayed to Area X by inputs from the motor pathway (Charlesworth et al., 2012) or LMAN (Fee and Goldberg, 2011; Kao et al., 2005). Signals encoding outcomes – whether or not a given rendition escapes WN – plausibly derive from rich neuromodulatory inputs to the AFP, including from midbrain dopaminergic neurons (Gadagkar et al., 2016; Hoffmann et al., 2016). The association between contextual signals (HVCX activity) and appropriate motor-biasing AFP activity could then be mediated by plasticity at cortical-striatal (HVCX-X) synapses (Fee and Goldberg, 2011), as has been implicated for decision-making tasks in mammals (Xiong et al., 2015).

Our finding that generalization of learning persists following pharmacological inactivation of the AFP (Figure 4) indicates that this generalization largely depends on the modification of a core syllable representation in the downstream motor pathway. The presence of such a core representation is consistent with recordings in the motor pathway nucleus RA showing that similar populations of neurons are active during the production of a given syllable regardless of the sequence in which it is sung (Leonardo and Fee, 2005; Wohlgemuth et al., 2010). We hypothesize that this motor pathway representation is gradually modified in response to biasing signals from the AFP in a process of systems consolidation (Andalman and Fee, 2009; Fee and Goldberg, 2011; Warren et al., 2011). To the extent that the overlapping neural elements (such as synapses from HVC afferents onto RA neurons) are active during the production of a syllable in multiple contexts, modification of those shared elements, driven by AFP bias in one context, would naturally contribute to circuit changes that generalize to the production of the syllable in other contexts.

Consistent with this model, we found that the degree of transfer of learning to the motor pathway depends on the extent to which biasing signals from the AFP are coherent across contexts (Figure 5). Our results indicate that when it is optimal to generalize modifications across contexts - for example, during initial learning or in response to weakening of musculature or other perturbations that affect control of a syllable regardless of context - consistent biasing signals from the AFP will promote an updating of the core MP representation. In contrast, when context-specificity is appropriate - for example, to modify central commands in a manner that accounts for context-dependent dynamics of the musculoskeletal system (Bouchard and Chang, 2014; Ostry et al., 1996; Schmidt and Wild, 2014; Wohlgemuth et al., 2010) - conflicting biasing signals will interfere with consolidation, and learning will continue to rely on moment-by-moment modulation by the AFP. Such a dependence of consolidation on the coherence of AFP bias may therefore be a natural way for the nervous system to transfer modifications that are generally appropriate to primary motor circuitry, while reserving frontal, “executive” circuitry for dynamically adjusting performance in response to context-specific requirements (Duan et al., 2015; Hilario et al., 2012; Kim and Hikosaka, 2013; Miller and Cohen, 2001; Narayanan and Laubach, 2006).

More broadly, a similar balance between generalization and specificity of learning in human motor skill adaptation (Houde and Jordan, 1998; Howard and Franklin, 2015; Howard et al., 2012; Rochet-Capellan and Ostry, 2011; Rochet-Capellan et al., 2012) may also reflect separate contributions of primary motor representations and flexible top-down bias from frontal cortical-basal ganglia circuits. Indeed, neural signals indicating sequential context are present in mammalian cortical-basal ganglia circuitry (Dudman and Krakauer, 2016; Mello et al., 2015; Mushiake and Strick, 1995; Tanji and Shima, 1994; Turner and Desmurget, 2010), and the contributions of basal ganglia circuitry to motor production may include a role in flexible fine time-scale modulation of movement kinematics (Dudman and Krakauer, 2016; Rueda-Orozco and Robbe, 2015; Turner and Desmurget, 2010). Hence, the critical contributions of frontal cortical-basal ganglia circuits to sequence-dependent vocal learning in the songbird may reflect a general role of these circuits in integrating contextual cues to enable adaptive, context-dependent learning and execution of motor skills.

STAR Methods

CONTACT FOR REAGENT AND RESOURCE SHARING

Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Lucas Tian (lucas.tian@ucsf.edu).

EXPERIMENTAL MODEL AND SUBJECT DETAILS

Animal models

We used 12 adult (range: 141 to 671 days old at start of experiment) male Bengalese finches (Lonchura striata domestica) that were bred in our colony and housed with their parents until at least 60 days of age. During experiments, birds were housed individually in sound-attenuating chambers (Acoustic Systems) on a 14h/10h light/dark cycle with food and water provided ad libitum. All experiments were performed on undirected song (i.e., with no female present). All procedures were in accordance with protocols approved by the University of California, San Francisco Institutional Animal Care and Use Committee.

METHOD DETAILS

Song recording and computerized training paradigm

We used a custom-written Labview program (National Instruments) to record song and deliver white noise feedback during training (Charlesworth et al., 2011, 2012; Tumer and Brainard, 2007; Warren et al., 2011). Briefly, song was recorded with an omnidirectional lavalier microphone (Countryman), bandpass filtered between 75 Hz and 10 kHz, and digitized at 32 kHz. To detect a specific segment of a specific syllable for targeted reinforcement, the spectrum of each successive 8ms segment of ongoing song was tested for a match to a preconstructed spectral template (based on the Euclidian distance between those spectra). Upon a match, the fundamental frequency (FF) of that segment was compared to a preset FF threshold. To drive upwards shifts in FF, feedback was delivered with <1 ms latency if FF was below threshold; to drive downwards shifts, feedback was delivered only if FF was above threshold. Feedback was a 40–60 ms burst of white noise (WN) at 90–95 dB(A). To provide context-dependent reinforcement, we modified the training paradigm so that delivery of WN was contingent not only on the FF of the target syllable, but also on the identity of the syllables preceding the target syllable (the “sequential context” as described further below).

Determining sequential context for each rendition of a given syllable

Syllables were classified manually by visual inspection of spectrograms. Similar to a previous study in Bengalese finches (Wohlgemuth et al., 2010), for a given bird we detected cases in which the same syllable type was sung across different sequential contexts using a method based on the Acoustic Distance (a measure of difference in acoustic structure) between syllables in a pair (see “Multiple regression analysis of generalization” for calculation of Acoustic Distance). The distribution of Acoustic Distances across syllable pairs was bimodal. The Acoustic Distance at which the distribution had a local minimum between these two modes was used as a classification threshold - hence, any syllable pairs with Acoustic Distance within the first mode of the distribution were classified as same-type syllables. This result of this method is similar to that from subjective hand labeling (Wohlgemuth et al., 2010). We then defined “motifs” as stereotyped sequences of syllables that were reused across song bouts and were preceded, and sometimes followed, by introductory notes or song termination. The sequential context for each rendition of a given syllable was then defined by the directly preceding syllables in that rendition’s motif (including introductory notes preceding the motif). For example, if a bird had a repertoire consisting of two motifs, AABHCD and AHCGDC, then across all song bouts, that bird could sing C in three potential contexts (i.e., following either BH, AH, or GD). In cases of “repeated” syllables (i.e., a syllable repeated successively >3 times in the same motif, such as B in ACBBBB; n = 6), we only included the first rendition of the repeat to avoid over-representing the syllable.

Single context training

Birds were trained to shift the FF of a syllable in one sequential context, and no reinforcement was provided in any other context. We performed a total of 36 single context experiments in 12 birds. In 29/36 cases we targeted a unique syllable/context combination (In 11 cases we targeted a syllable that had previously been targeted, but in a different context (Figure S1B); in six cases we targeted a previously targeted syllable/context combination, but drove learning in the opposite direction). In 22 experiments we drove FF up; in the other 14 experiments we drove FF down. For presentation of results, the direction of “learning” is defined as the direction that escaped WN.

At the onset of training, the FF threshold for reinforcement was set at the 70th percentile of FF determined from the last baseline day, so that ~70% of renditions were “hits”, and ~30% were “escapes”. WN training began when lights turned on in the morning of the first training day. Learning was quantified as the mean FF (see “FF calculation” below) across renditions on days 3 and 4 of training minus the mean FF across renditions on the last two baseline days. Because successful learning results in a reduced hit rate, the FF threshold was adjusted 1–2x a day over 2–4 days to maintain a hit rate of ~70% (in 6 cases until day 2; in 9 cases until day 3, in 21 cases until day 4). In a small number of experiments, the bird’s singing rate dropped dramatically for 2–4 days when training was initiated (n = 5 experiments, <5 catch bouts/day); in these cases the first day with substantial singing was treated as the first day of training.

To estimate the amount of change in FF that could occur due to “drift” under control conditions we collected at least six days of continuous baseline singing data in 24 experiments directly preceding the start of WN. For these experiments, we measured the amount of change in FF that occurred in the absence of WN over the same duration used in the analysis of learning; this “baseline drift” was computed as the difference between mean FF on days 5–6 and the mean FF on days 1–2 of baseline recordings.

Generalization was defined for non-target syllables/contexts as the change in FF calculated as a percent of the change in FF for the target syllable in the target context in the same experiment. For analyses of generalization, we only used experiments with significant learning in the target context, because generalization is not well-defined in the absence of learning in the target context. The criterion for significant learning was that the shift in FF exceeded the 97.5th percentile of baseline drift pooled across syllables and birds (n = 30/36 experiments met that criterion). For all other analyses, we included all 36 experiments.

To determine whether “off target” delivery of WN could have influenced measured values of generalization, we measured the frequency with which WN was delivered to targeted syllables in non-target contexts within the set of songs that were used to quantify learning over days 1–4 of WN training. In 36 out of 48 cases the frequency of off-target hits was 0% (Figure S1Cii, blue histogram). In the remaining 12 non-target context cases the median frequency of mis-targeting was 1.2%, with a range of 0.2% to 2.7%. Similarly, for different-type syllables, for 227/235 cases the frequency of off-target hits was 0% (Figure S1Cii, brown histogram). In the remaining 18 cases for different type syllables, the median hit frequency was 1.0%, with a range of 0.2% to 5.6%. Moreover, regression analyses confirmed that the rare off-target hits do not explain the patterns of generalization that we report in our manuscript (Figure S1Ciii, iv).

To test what features of syllables, when sung in different contexts, best predict the magnitude generalization across contexts, we fit a multiple linear regression model to examine the extent to which a linear combination of three variables (contextual similarity, acoustic distance, and FF correlation) predicted the response variable (generalization). For details, see “Multiple regression analysis of generalization” below.

Dual context training

In a subset (n = 13 experiments, 9 birds) of the single context experiments described above, we extended the duration of single context training (mean ± SD = 12.4 ± 6.4 total days, with 4.1 ± 2.1 days of incremental adjustment of FF threshold at the start of training). This “single context phase” was followed immediately by a “dual context phase”, during which a contingency was introduced to shift FF of the target syllable in a second context in the direction opposite that in the first context. Over ~3 – 5 days of the dual context phase we incrementally adjusted the FF threshold in the second context 1–2x a day to maintain a ~70% hit rate. Throughout that period we did not change the FF threshold in the first context, except in cases where FF in the first context shifted towards baseline to a point where >70% of renditions were being hit. In that case, in order to maintain an instructive reinforcing signal in the first context, we adjusted the FF threshold to maintain the hit rate at ~70%.

LMAN inactivation

We used microdialysis to infuse the GABAA receptor agonist muscimol (Tocris, Catalog #: 0289) into LMAN to transiently silence neural activity during learning (Lindefors et al., 1989; Warren et al., 2011). Bilateral guide cannulas (CMA 7, Harvard Apparatus) were first stereotaxically implanted over LMAN. During implantation, the bird was positioned so that the ventral surface of the upper beak was 40° below horizontal. Cannulas were centered at 5.45 – 5.65 mm rostral and 1.5 mm lateral to the caudal point of the intersection of the midsaggital and transverse sinuses (i.e., “Y0”), and lowered to a depth such that the tip of the probe that would subsequently be inserted into the cannula would be 2.4 mm deep relative to the surface of the brain. Our goal was to position the tip of the probe at the center of LMAN in the rostral-lateral plane, and ~200 μm below the dorsal surface of LMAN. After birds recovered from surgery and were singing (~2 days), we inserted microdialysis probes (CMA 7, 1 mm membrane length, diameter 0.24 mm, 6 kDa cutoff) into the cannulas. The output of one probe was used as the input to the other probe. Probes were connected to pumps via flexible tubing and PBS was continuously infused, except during LMAN inactivation when muscimol was infused (see below). Solutes diffuse through the membrane while maintaining zero net volume transfer. In some cases, the tubing was interfaced with a dual channel liquid commutator (Instech Labs 2-Channel Microdialysis Swivel). In all cases birds could comfortably move and sing during infusion. The pump was outside the sound-attenuating chamber, allowing us to switch solutions without disturbing the bird. Flow rate was maintained at 0.3 – 0.5 μl/min and increased to 0.8 – 1.0 μl/min during muscimol infusion. The concentration of muscimol (dissolved in PBS) ranged from 100 μM to 700 μM across experiments, and was calibrated before each experiment to elicit a reduction in FF variability (a marker of successful LMAN inactivation) before training began (see below and Figures S4A, B).

LMAN inactivation was performed in a similar time window on each inactivation day for a given experiment (~12:30 pm to ~4:00 pm). We analyzed songs starting after a lag from the switch to muscimol, which accounts for flow of drug through tubing and diffusion within tissue. The duration of that lag was separately determined for each experiment based on the amount of time it took from the start of infusion to observe an ~30% reduction in FF coefficient of variation (CV, standard deviation divided by the mean) - that effect is a consistent indicator of lesion (Hampton et al., 2009) or inactivation (Warren et al., 2011) of LMAN. The duration, based solely on baseline days, from the start of infusion until FF CV was reduced to a stable value was used as the lag duration for the entire experiment (Figure S4A; mean lag, 94.2 min; SD, 31.2 min). Muscimol infusion successfully reduced FF CV across syllable types and contexts, both during baseline and training (Figure S4B). In the 12/14 experiments in which we restricted analyses to catch bouts (see “FF calculation”), starting from the time when muscimol data were collected (i.e., the end of the lag period), we transiently increased the catch rate (on average increased to 0.8 from 0.15) to allow us to collect a sample of catch song bouts of comparable size to the sample collected pre-inactivation. This was necessary because the duration of singing during inactivation was lower than before inactivation. The catch rate was decreased back to its normal value at the end of muscimol infusion. FF during PBS was quantified in a time window starting at ~8:30 am (lights were turned on at 7:00 am) and ending at the PBS-to-muscimol switch time.

FF shifts during PBS and muscimol infusion were normalized relative to their respective baselines (number of days directly preceding start of training, range: 3 – 7 days; muscimol inactivation baseline data were collected in a subset of those days, range: 2–4 days). PBS shift was defined as FF during PBS infusion minus baseline FF during PBS infusion, while muscimol shift was defined as FF during muscimol infusion minus baseline FF during muscimol infusion. The differences between baseline FF during muscimol and PBS infusion were small and not in a consistent direction; therefore all of our results held if we instead normalized muscimol FF to baseline FF during PBS infusion.

LMAN inactivation during single context training

Based on a previous LMAN inactivation study (Warren et al., 2011), we defined a maintained learning period as a period of at least five days during which i) the FF threshold for WN was no longer being adjusted and ii) each day’s mean FF was within a window defined by the mean FF across all days ± 0.75 times the mean of within-day standard deviations of FF. On average, the maintained learning period for single context experiments started on day 5.2 (S.D. = 2.8) relative to start of training. The period in which LMAN inactivation data were obtained started on day 4 of training, when a large change in FF in the target context had been reached, and ended on day 10 of training or day 4 of the maintained shift period, whichever was earlier. We defined this as an “early” period in the learning trajectory, during which the AFP has been shown to contribute significantly to the expression of learning (Warren et al., 2011). Our main results held when we used other windows (first and last day modified by ± 1 or 2 days). LMAN inactivation days were usually separated by at least one day and data from multiple inactivation days were averaged (separately for baseline and learning days). For comparison of effects of LMAN inactivation for target vs. non-target contexts, effects for multiple non-target contexts were averaged to get a single mean value for each experiment.

LMAN inactivation during Congruent and Incongruent training

We measured the contribution of the AFP to expression of learning for Congruent and Incongruent training experiments, in both early and late periods in the learning trajectory. For 12 experiments, learning was driven and maintained in context 1, following which learning was then driven in the second context in either Congruent (i.e., same direction as in context 1, n = 5), or Incongruent (i.e., opposite direction from context 1, n = 7) directions, while maintaining the reinforcement in the first context (n = 6 birds, with 5 birds that contributed Congruent experiments also contributing 6 Incongruent experiments). Effects of LMAN inactivation were assessed relative to the onset of a period of maintained learning, defined as the intersection of maintained learning periods separately determined for each context (as described above for single context experiments). AFP bias and consolidation, inferred from effects of LMAN inactivation on learning, were grouped and averaged over an early period (days 1–4) and late period (days 5–6) of maintained learning. For 3 out of 12 experiments, feedback in context 2 was provided in the opposite direction prior to the onset of Congruent (n = 2) or Incongruent (n = 1) training. The exclusion of those 3 experiments did not alter the significance of the effects of LMAN inactivation (Figure S5B).

Localization of probes

We performed post-mortem histology on sectioned (40 μm thick, coronal) tissue to confirm placement of probes within or directly adjacent to LMAN. Tissue damage, revealed by Nissl or DAPI stain, indicated the location of the probe. LMAN was visualized by immunostaining for calcitonin gene related peptide (Sigma, RRID: AB_259091, 1:5000 to 1:10000) (Bottjer et al., 1997).

QUANTIFICATION AND STATISTICAL ANALYSIS

Overview

Unless noted otherwise, to compare two samples we used the nonparametric two-sided Wilcoxon rank-sum test and for paired samples we used the nonparametric two-sided Wilcoxon signed-rank test. Within-group variances were similar for groups being compared. All regression analyses were performed using the ordinary least squares method. Tests were deemed statistically significant if p < 0.05. Statistical details for all experiments are included in their corresponding figure legends. For experiments corresponding to Figures 14, no randomization was required in allocating animals to experimental groups because each animal contributed to both experimental groups (dimension 1: target vs. nontarget context; dimension 2: PBS vs. muscimol infusion). For the experiment in Figure 5, randomization in allocation to experimental types (Congruent vs. Incongruent) was not required because almost all (5/6) animals contributed data to both experimental types. Syllable labeling was performed blind to magnitude of learning and LMAN inactivation effects. For LMAN inactivation experiments, experimenters were not blinded to whether data were from PBS or muscimol infusion periods, as muscimol infusion causes changes to pitch CV that are conspicuous even during visual inspection of spectrograms. No datasets were excluded unless appropriate as described elsewhere [i.e., in calculation of percent generalization (see “Single context training” above) or in control analyses which were restricted, by design, to a subset of experiments (Figures S3A, S4C, S5B)]. Sample sizes were not predetermined but were comparable to previous related studies (Andalman and Fee, 2009; Charlesworth et al., 2011, 2012; Tumer and Brainard, 2007; Warren et al., 2011). All analyses were performed using custom-written MATLAB (Mathworks) software.

FF calculation

All analyses were performed on FF values that were calculated offline. In 33/36 experiments we analyzed only “catch” bouts, which were a randomly interleaved 7–16 % of song bouts in which reinforcement was withheld. In the other three experiments, we analyzed both catch bouts and a subset of bouts in which reinforcement occurred normally (“training bouts”). In experiments in which we analyzed training bouts, we excluded from analysis the two syllables directly following the target syllable, to avoid potential acute effects of WN on the FF of those syllables (Sakata and Brainard, 2006). For each rendition, we calculated a spectrogram using a Gaussian-windowed (σ = 1 ms) short-time Fourier transform (window size = 1024 samples; overlap = 1020 samples; sampling rate = 32 kHz). Within each time bin, FF was defined as the frequency corresponding to peak power of the first harmonic, estimated using parabolic interpolation. FF for the rendition was then calculated as the mean FF across time bins for a fixed window defined relative to syllable onset (mean window size = 14.4 ms). All syllables consisting of largely broadband noise (e.g. introductory note J in Figure 1A) were excluded from learning analyses.

Multiple regression analysis of generalization

We fit a multiple linear regression model to examine the extent to which a linear combination of three variables (contextual similarity, acoustic distance, and FF correlation) predicted the response variable (generalization) in experiments driving learning in only a single context (see “Single context training” above).

Contextual similarity was coded as a discrete variable with values 0, 1 or 2 corresponding to the number of syllables, directly preceding the target syllable, that were shared in the target and non-target contexts (see main text and Figure 1E for details).

Acoustic distance was measured between the target syllable when sung in the target and non-target contexts as the mean Euclidian distance in an 8-dimensional feature vector space. The acoustic features used were FF, duration, spectral entropy, temporal entropy, spectro-temporal entropy, amplitude slope, frequency slope and time to half-peak amplitude. All features were calculated as in (Wohlgemuth et al., 2010), with slight differences for FF (described in “FF calculation”) and frequency slope (as in Sakata and Brainard, 2006). For each syllable in each context, we calculated a mean feature vector across renditions from baseline recordings. The feature vectors for each syllable were normalized (via z-score relative to a global reference distribution of feature vectors from 110 randomly sampled baseline renditions from each syllable in each context), and acoustic distance between any two syllables was calculated as the distance between the mean z-scored feature vectors for those syllables.

FF correlation was measured as the Pearson’s correlation of FF for a syllable in two different contexts across song bouts. If a syllable in a specific context was sung more than once in a given song bout, we first took the average across those renditions to obtain one value of FF for each context for that song bout. Therefore, each pairwise correlation was calculated between two vectors, one for each context in the pair, each with length equal to the number of song bouts in the dataset.

The parameters in the model were fit using the ordinary least squares method. The continuous predictor variables (acoustic distance and FF correlation) were first scaled such that a unit change in the scaled variable corresponded to a change of 1.59 times the sample standard deviation of that variable. This was performed to facilitate comparison with the regression coefficient for contextual similarity, since a unit change in contextual similarity corresponded to a change of 1.59 times its sample standard deviation.

DATA AND SOFTWARE AVAILABILITY

Data and custom-written software are available upon request.

KEY RESOURCES TABLE

REAGENT or RESOURCE SOURCE IDENTIFIER
Antibodies
Anti-Calcitonin Gene Related Peptide Sigma RRID: AB_259091
Chemicals, Peptides, and Recombinant Proteins
Muscimol Tocris Cat#0289
Experimental Models: Organisms/Strains
Bengalese finch (Lonchura striata domestica) This lab N/A
Software and Algorithms
MATLAB Mathworks www.mathworks.com
Custom-written MATLAB code for data analysis This paper Request from Lead Contact
Labview National Instruments http://www.ni.com/en-us/shop/labview.html
Custom-written Labview software for song acquisition and WN-driven training Tumer and Brainard, 2007; This paper Request from Lead Contact
Other
Guide cannula (CMA 7) Harvard Apparatus Cat#8010684
Microdialysis probe (CMA7, 1mm) Harvard Apparatus Cat#P000082

Supplementary Material

supplement

Highlights.

  • Birdsong exhibits context-dependent vocal learning similar to human speech

  • Syllable modifications can be specific to or generalize across distinct sequences

  • Specificity reflects context-specific biasing from cortical-basal ganglia circuitry

  • Generalization reflects changes to a core syllable representation in motor circuitry

Acknowledgments

We thank K. Ganguly, J. Houde, A. Karpova, A. Nelson, P. Sabes, and members of the Brainard lab for discussions and comments on the manuscript. This work was supported by the Howard Hughes Medical Institute and NIH grants R01DC006636 and R01MH055987.

Footnotes

AUTHOR CONTRIBUTIONS

Conceptualization, L.Y.T. and M.S.B.; Methodology, L.Y.T. and M.S.B.; Software, L.Y.T.; Formal Analysis, L.Y.T.; Investigation, L.Y.T.; Resources, M.S.B.; Writing - Original Draft, L.Y.T.; Writing - Review & Editing, L.Y.T. and M.S.B.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Andalman AS, Fee MS. A basal ganglia-forebrain circuit in the songbird biases motor output to avoid vocal errors. Proc Natl Acad Sci. 2009;106:12518–12523. doi: 10.1073/pnas.0903214106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ansuini C, Giosa L, Turella L, Altoè G, Castiello U. An object for an action, the same object for other actions: effects on hand shaping. Exp Brain Res. 2008;185:111–119. doi: 10.1007/s00221-007-1136-4. [DOI] [PubMed] [Google Scholar]
  3. Bottjer SW, Miesner EA, Arnold AP. Forebrain Lesions Disrupt Development but not Maintenance of Song in Passerine Birds. Science. 1984;224:901–903. doi: 10.1126/science.6719123. [DOI] [PubMed] [Google Scholar]
  4. Bottjer SW, Roselinsky H, Tran NB. Sex Differences in Neuropeptide Staining of Song-Control Nuclei in Zebra Finch Brains. Brain Behav Evol. 1997;50:284–303. doi: 10.1159/000113342. [DOI] [PubMed] [Google Scholar]
  5. Bouchard KE, Chang EF. Control of Spoken Vowel Acoustics and the Influence of Phonetic Context in Human Speech Sensorimotor Cortex. J Neurosci. 2014;34:12662–12677. doi: 10.1523/JNEUROSCI.1219-14.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Brainard MS, Doupe AJ. Interruption of a basal ganglia–forebrain circuit prevents plasticity of learned vocalizations. Nature. 2000;404:762–766. doi: 10.1038/35008083. [DOI] [PubMed] [Google Scholar]
  7. Caudrelier T, Perrier P, Schwartz J-L, Rochet-Capellan A. Does Auditory-Motor Learning of Speech Transfer from the CV Syllable to the CVCV Word? Interspeech 2016 Proc. 2016:2095–2099. [Google Scholar]
  8. Charlesworth JD, Tumer EC, Warren TL, Brainard MS. Learning the microstructure of successful behavior. Nat Neurosci. 2011;14:373–380. doi: 10.1038/nn.2748. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Charlesworth JD, Warren TL, Brainard MS. Covert skill learning in a cortical-basal ganglia circuit. Nature. 2012;486:251–255. doi: 10.1038/nature11078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Diedrichsen J, Kornysheva K. Motor skill learning between selection and execution. Trends Cogn Sci. 2015;19:227–233. doi: 10.1016/j.tics.2015.02.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Doupe AJ, Kuhl PK. Birdsong and human speech: common themes and mechanisms. Annu Rev Neurosci. 1999;22:567–631. doi: 10.1146/annurev.neuro.22.1.567. [DOI] [PubMed] [Google Scholar]
  12. Duan CA, Erlich JC, Brody CD. Requirement of Prefrontal and Midbrain Regions for Rapid Executive Control of Behavior in the Rat. Neuron. 2015;86:1491–1503. doi: 10.1016/j.neuron.2015.05.042. [DOI] [PubMed] [Google Scholar]
  13. Dudman JT, Krakauer JW. The basal ganglia: from motor commands to the control of vigor. Curr Opin Neurobiol. 2016;37:158–166. doi: 10.1016/j.conb.2016.02.005. [DOI] [PubMed] [Google Scholar]
  14. Engel KC, Flanders M, Soechting JF. Anticipatory and sequential motor control in piano playing. Exp Brain Res. 1997;113:189–199. doi: 10.1007/BF02450317. [DOI] [PubMed] [Google Scholar]
  15. Fee MS, Goldberg JH. A hypothesis for basal ganglia-dependent reinforcement learning in the songbird. Neuroscience. 2011;198:152–170. doi: 10.1016/j.neuroscience.2011.09.069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Fujimoto H, Hasegawa T, Watanabe D. Neural Coding of Syntactic Structure in Learned Vocalizations in the Songbird. J Neurosci. 2011;31:10023–10033. doi: 10.1523/JNEUROSCI.1606-11.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Gadagkar V, Puzerey PA, Chen R, Baird-Daniel E, Farhang AR, Goldberg JH. Dopamine neurons encode performance error in singing birds. Science. 2016;354:1278–1282. doi: 10.1126/science.aah6837. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Hampton CM, Sakata JT, Brainard MS. An Avian Basal Ganglia-Forebrain Circuit Contributes Differentially to Syllable Versus Sequence Variability of Adult Bengalese Finch Song. J Neurophysiol. 2009;101:3235–3245. doi: 10.1152/jn.91089.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Hilario M, Holloway T, Jin X, Costa RM. Different dorsal striatum circuits mediate action discrimination and action generalization. Eur J Neurosci. 2012;35:1105–1114. doi: 10.1111/j.1460-9568.2012.08073.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Hoffmann LA, Sober SJ. Vocal Generalization Depends on Gesture Identity and Sequence. J Neurosci. 2014;34:5564–5574. doi: 10.1523/JNEUROSCI.5169-13.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Hoffmann LA, Saravanan V, Wood AN, He L, Sober SJ. Dopaminergic Contributions to Vocal Learning. J Neurosci. 2016;36:2176–2189. doi: 10.1523/JNEUROSCI.3883-15.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Houde JF, Jordan MI. Sensorimotor Adaptation in Speech Production. Science. 1998;279:1213–1216. doi: 10.1126/science.279.5354.1213. [DOI] [PubMed] [Google Scholar]
  23. Howard IS, Franklin DW. Neural Tuning Functions Underlie Both Generalization and Interference. PLoS ONE. 2015;10:e0131268. doi: 10.1371/journal.pone.0131268. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Howard IS, Ingram JN, Franklin DW, Wolpert DM. Gone in 0.6 Seconds: The Encoding of Motor Memories Depends on Recent Sensorimotor States. J Neurosci. 2012;32:12756–12768. doi: 10.1523/JNEUROSCI.5909-11.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Izawa J, Shadmehr R. Learning from Sensory and Reward Prediction Errors during Motor Adaptation. PLoS Comput Biol. 2011;7:e1002012. doi: 10.1371/journal.pcbi.1002012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Jerde TE, Soechting JF, Flanders M. Coarticulation in fluent fingerspelling. J Neurosci. 2003;23:2383–2393. doi: 10.1523/JNEUROSCI.23-06-02383.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kao MH, Doupe AJ, Brainard MS. Contributions of an avian basal ganglia–forebrain circuit to real-time modulation of song. Nature. 2005;433:638–643. doi: 10.1038/nature03127. [DOI] [PubMed] [Google Scholar]
  28. Kim HF, Hikosaka O. Distinct Basal Ganglia Circuits Controlling Behaviors Guided by Flexible and Stable Values. Neuron. 2013;79:1001–1010. doi: 10.1016/j.neuron.2013.06.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Leonardo A, Fee MS. Ensemble Coding of Vocal Control in Birdsong. J Neurosci. 2005;25:652–661. doi: 10.1523/JNEUROSCI.3036-04.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Lindefors N, Amberg G, Ungerstedt U. Intracerebral Microdialysis: I. Experimental Studies of Diffusion Kinetics. J Pharmacol Methods. 1989;22:141–156. doi: 10.1016/0160-5402(89)90011-9. [DOI] [PubMed] [Google Scholar]
  31. Mello GBM, Soares S, Paton JJ. A Scalable Population Code for Time in the Striatum. Curr Biol. 2015;25:1113–1122. doi: 10.1016/j.cub.2015.02.036. [DOI] [PubMed] [Google Scholar]
  32. Miller EK, Cohen JD. An integrative theory of prefrontal cortex function. Annu Rev Neurosci. 2001;24:167–202. doi: 10.1146/annurev.neuro.24.1.167. [DOI] [PubMed] [Google Scholar]
  33. Mooney R. Auditory-vocal mirroring in songbirds. Philos Trans R Soc B Biol Sci. 2014;369:20130179–20130179. doi: 10.1098/rstb.2013.0179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Mushiake H, Strick PL. Pallidal neuron activity during sequential arm movements. J Neurophysiol. 1995;74:2754–2758. doi: 10.1152/jn.1995.74.6.2754. [DOI] [PubMed] [Google Scholar]
  35. Narayanan NS, Laubach M. Top-Down Control of Motor Cortex Ensembles by Dorsomedial Prefrontal Cortex. Neuron. 2006;52:921–931. doi: 10.1016/j.neuron.2006.10.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Nottebohm F, Stokes TM, Leonard CM. Central control of song in the canary, Serinus canarius. J Comp Neurol. 1976;165:457–486. doi: 10.1002/cne.901650405. [DOI] [PubMed] [Google Scholar]
  37. Ostry DJ, Gribble PL, Gracco VL. Coarticulation of jaw movements in speech production: is context sensitivity in speech kinematics centrally planned? J Neurosci. 1996;16:1570–1579. doi: 10.1523/JNEUROSCI.16-04-01570.1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Rochet-Capellan A, Ostry DJ. Simultaneous Acquisition of Multiple Auditory–Motor Transformations in Speech. J Neurosci. 2011;31:2657–2662. doi: 10.1523/JNEUROSCI.6020-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Rochet-Capellan A, Richer L, Ostry DJ. Nonhomogeneous transfer reveals specificity in speech motor learning. J Neurophysiol. 2012;107:1711–1717. doi: 10.1152/jn.00773.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Rueda-Orozco PE, Robbe D. The striatum multiplexes contextual and kinematic information to constrain motor habits execution. Nat Neurosci. 2015;18:453–460. doi: 10.1038/nn.3924. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Sakata JT, Brainard MS. Real-Time Contributions of Auditory Feedback to Avian Vocal Motor Control. J Neurosci. 2006;26:9619–9628. doi: 10.1523/JNEUROSCI.2027-06.2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Schmidt MF, Wild JM. The respiratory-vocal system of songbirds: Anatomy, physiology, and neural control. Prog Brain Res. 2014;212:297–335. doi: 10.1016/B978-0-444-63488-7.00015-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Shadmehr R, Mussa-Ivaldi FA. Adaptive representation of dynamics during learning of a motor task. J Neurosci. 1994;14:3208–3224. doi: 10.1523/JNEUROSCI.14-05-03208.1994. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Shah A, Barto AG, Fagg AH. A dual process account of coarticulation in motor skill acquisition. J Mot Behav. 2013;45:531–549. doi: 10.1080/00222895.2013.837423. [DOI] [PubMed] [Google Scholar]
  45. Simpson HB, Vicario DS. Brain pathways for learned and unlearned vocalizations differ in zebra finches. J Neurosci. 1990;10:1541–1556. doi: 10.1523/JNEUROSCI.10-05-01541.1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Sosnik R, Hauptmann B, Karni A, Flash T. When practice leads to co-articulation: the evolution of geometrically defined movement primitives. Exp Brain Res. 2004;156:422–438. doi: 10.1007/s00221-003-1799-4. [DOI] [PubMed] [Google Scholar]
  47. Tanji J, Shima K. Role for supplementary motor area cells in planning several movements ahead. Nature. 1994;371:413–416. doi: 10.1038/371413a0. [DOI] [PubMed] [Google Scholar]
  48. Tumer EC, Brainard MS. Performance variability enables adaptive plasticity of “crystallized” adult birdsong. Nature. 2007;450:1240–1244. doi: 10.1038/nature06390. [DOI] [PubMed] [Google Scholar]
  49. Turner RS, Desmurget M. Basal ganglia contributions to motor control: a vigorous tutor. Curr Opin Neurobiol. 2010;20:704–716. doi: 10.1016/j.conb.2010.08.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Vu ET, Mazurek ME, Kuo YC. Identification of a forebrain motor programming network for the learned song of zebra finches. J Neurosci. 1994;14:6924–6934. doi: 10.1523/JNEUROSCI.14-11-06924.1994. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Wainscott SK, Donchin O, Shadmehr R. Internal Models and Contextual Cues: Encoding Serial Order and Direction of Movement. J Neurophysiol. 2004;93:786–800. doi: 10.1152/jn.00240.2004. [DOI] [PubMed] [Google Scholar]
  52. Warren TL, Tumer EC, Charlesworth JD, Brainard MS. Mechanisms and time course of vocal learning and consolidation in the adult songbird. J Neurophysiol. 2011;106:1806–1821. doi: 10.1152/jn.00311.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Wohlgemuth MJ, Sober SJ, Brainard MS. Linked Control of Syllable Sequence and Phonology in Birdsong. J Neurosci. 2010;30:12936–12949. doi: 10.1523/JNEUROSCI.2690-10.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Woolley SC, Rajan R, Joshua M, Doupe AJ. Emergence of Context-Dependent Variability across a Basal Ganglia Network. Neuron. 2014;82:208–223. doi: 10.1016/j.neuron.2014.01.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Xiong Q, Znamenskiy P, Zador AM. Selective corticostriatal plasticity during acquisition of an auditory discrimination task. Nature. 2015;521:348–351. doi: 10.1038/nature14225. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

RESOURCES