Skip to main content
eLife logoLink to eLife
. 2019 Dec 10;8:e46015. doi: 10.7554/eLife.46015

Neural ensemble dynamics in dorsal motor cortex during speech in people with paralysis

Sergey D Stavisky 1,2,, Francis R Willett 1,2, Guy H Wilson 3, Brian A Murphy 4,5, Paymon Rezaii 1, Donald T Avansino 1, William D Memberg 4,5, Jonathan P Miller 5,6, Robert F Kirsch 4,5, Leigh R Hochberg 7,8,9, A Bolu Ajiboye 4,5, Shaul Druckmann 10, Krishna V Shenoy 2,10,11,12,13,14,, Jaimie M Henderson 1,13,14,
Editors: Tamar R Makin15, Barbara G Shinn-Cunningham16
PMCID: PMC6954053  PMID: 31820736

Abstract

Speaking is a sensorimotor behavior whose neural basis is difficult to study with single neuron resolution due to the scarcity of human intracortical measurements. We used electrode arrays to record from the motor cortex ‘hand knob’ in two people with tetraplegia, an area not previously implicated in speech. Neurons modulated during speaking and during non-speaking movements of the tongue, lips, and jaw. This challenges whether the conventional model of a ‘motor homunculus’ division by major body regions extends to the single-neuron scale. Spoken words and syllables could be decoded from single trials, demonstrating the potential of intracortical recordings for brain-computer interfaces to restore speech. Two neural population dynamics features previously reported for arm movements were also present during speaking: a component that was mostly invariant across initiating different words, followed by rotatory dynamics during speaking. This suggests that common neural dynamical motifs may underlie movement of arm and speech articulators.

Research organism: Human

eLife digest

Speaking involves some of the most precise and coordinated movements humans make. Learning how the brain produces speech could lead to better treatments for speech disorders. But it can be challenging to study. Human speech is unique, limiting what can be learned from animal studies. There also are few opportunities where it would be safe or ethical to take measurements from inside a person’s brain while they talk. Most previous studies have recorded brain activity during speech in patients who have had electrodes placed in the brain for epilepsy or Parkinson’s disease treatment.

Now, Stavisky et al. show that brain cells that control hand and arm movements are also active during speech. Two patients who had lost the use of their arms and legs but were able to speak participated in the study. The two individuals were already enrolled in a pilot clinical trial of a brain-computer interface to help them control prosthetic devices. As part of this trial, the volunteer participants had two 100-electrode arrays surgically placed in the part of the brain that controls the movement of the arms and hands.

This study made the unexpected discovery that brain cells multitask controlling not just arm and hand movements, but also carry information about movements of the lips, tongue and mouth necessary for speech. Stavisky et al. also found similarities in the patterns of brain activity during hand and arm movements and speech.

By analyzing the activity in these brain cells as the two individuals recited words and syllables, Stavisky et al. were also able to train computers to identify which sound the person spoke from the brain activity alone. This is a first step towards developing a technology that could synthesize speech from a person’s brain activity as they try to speak. Much more work is needed to synthesize continuous speech. But the study provides initial evidence that it might be possible to use recordings from inside the brain to one day restore speech to individuals who have lost it.

Introduction

Speaking requires coordinating numerous articulator muscles with exquisite timing and precision. Understanding how the sensorimotor system accomplishes this behavioral feat requires studying its neural underpinnings, which are critical for identifying (Tankus and Fried, 2018) and treating the causes of speech disorders and for building brain-computer interfaces (BCIs) to restore lost speech (Guenther et al., 2009; Herff and Schultz, 2016). Speaking is also a uniquely human behavior, which presents a high barrier to electrophysiological investigations. Previous direct neural recordings during speaking have come from electrocorticography (ECoG) (Bouchard and Chang, 2014; Cheung et al., 2016; Mugler et al., 2014) or single-unit (SUA) recordings from penetrating electrodes during the course of clinical treatment for epilepsy (Chan et al., 2014; Creutzfeldt et al., 1989; Tankus et al., 2012) or deep brain stimulation for Parkinson’s disease (Lipski et al., 2018; Tankus and Fried, 2018). Such studies have begun to characterize motor cortical population dynamics underlying speech (Bouchard et al., 2013; Chartier et al., 2018; Pei et al., 2011), but not at the finer spatial scale (compared to ECoG) or across larger neural ensembles (compared to single electrodes) afforded by the high-density intracortical recordings widely used in animal studies (Allen et al., 2019; Cohen and Maunsell, 2009; Kiani et al., 2014; Smith and Kohn, 2008), including those examining arm reaching (Carmena et al., 2003; Churchland et al., 2012; Kaufman et al., 2016; Maynard et al., 1999).

We studied speech production at this resolution by recording from multielectrode arrays previously placed in human motor cortex as part of the BrainGate2 BCI clinical trial (Hochberg et al., 2006). This research context dictated two important elements of the present study’s design. First, both participants had tetraplegia due to spinal-cord injury but were able to speak; this enabled observing motor cortical spiking activity during overt speaking, in contrast to earlier studies of attempted speech by participants unable to speak (Brumberg et al., 2011; Guenther et al., 2009). However, these participants’ long-term paralysis means that their neurophysiology may differ from that of people who are able-bodied; we will discuss the need for interpretation caution in the Discussion.

Second, the electrode arrays were in dorsal ‘hand knob’ area of motor cortex, which we previously found to strongly modulate to these participants’ attempted movement of their arm and hand (Ajiboye et al., 2017; Brandman et al., 2018; Pandarinath et al., 2017). Speech-related activity has not previously been reported in this cortical area, but there are several hints in the literature that dorsal motor cortex may have speech-related activity. Although imaging experiments consistently identify ventral cortical activation during speaking tasks, a meta-analysis of such studies (Guenther, 2016) indicates that responses are occasionally seen (though not, to our knowledge, explicitly called out) in dorsal motor cortex. Additionally, behavioral (Gentilucci and Campione, 2011; Vainio et al., 2013), transcranial magnetic stimulation studies (Devlin and Watkins, 2007; Meister et al., 2003), and electrical stimulation mapping studies (Breshears et al., 2018) have reported interactions (and interference) between motor control of the hand and mouth. This close linkage between hand and speech networks has been hypothesized to be due to a need for hand-mouth coordination and an evolutionary relationship between manual and articulatory gestures (Gentilucci and Stefani, 2012; Rizzolatti and Arbib, 1998). Here, we explicitly set out to test whether neuronal firing rates in this dorsal motor cortical area modulated when participants produced speech and orofacial movements.

Results

Speech-related activity in dorsal motor cortex

We recorded neural activity during speaking from participants ‘T5’ and ‘T8’, who previously had two arrays each consisting of 96 electrodes placed in the ‘hand knob’ area of motor cortex (Figure 1A,B). The participants performed a task in which on each trial they heard one of 10 different syllables or one of 10 short words, and then spoke the prompted sound after hearing a go cue (Figure 1—figure supplement 1 shows audio spectrograms and reaction times for these tasks). We analyzed both sortable SUA that could be attributed to an individual neuron’s action potentials (Figure 1C,D), and ‘threshold crossing’ spikes (TCs) that might come from one or several neurons (Figure 1—figure supplement 2). Firing rates showed robust changes during speaking of syllables (Figure 1, Figure 1—figure supplement 2, Video 1) and words (Figure 1—figure supplement 3). Significant modulation was found during speaking at least one syllable (p<0.05 compared to during silence) in 73/104 T5 electrodes’ TCs (13/22 SUA) and 47/101 T8 electrodes (12/25 SUA). Active neurons were distributed throughout the area sampled by the arrays, and most modulated to speaking multiple syllables (Figure 1B and Figure 1—figure supplement 4), suggesting a broadly distributed coding scheme. This is consistent with previous single neuron recordings in the temporal lobe (Creutzfeldt et al., 1989; Tankus et al., 2012).

Figure 1. Speech-related neuronal spiking activity in dorsal motor cortex.

(A) Participants heard a syllable or word prompt played from a computer speaker and were instructed to speak it back after hearing a go cue. Motor cortical signals and audio were simultaneously recorded during the task. The timeline shows example audio data recorded during one trial. (B) Participants’ MRI-derived brain anatomy. Blue squares mark the locations of the two chronic 96-electrode arrays. Insets show electrode locations, with shading indicating the number of different syllables for which that electrode recorded significantly modulated firing rates (darker shading = more syllables). Non-functioning electrodes are shown as smaller dots. CS is central sulcus. (C) Raster plot showing spike times of an example neuron across multiple trials of participant T5 speaking nine different syllables, or silence. Data are aligned to the prompt, the go cue, and acoustic onset (AO). (D) Trial-averaged firing rates (mean ± s.e.) for the same neuron and two others. Insets show these neurons’ action potential waveforms (mean ± s.d.). The electrodes where these neurons were recorded are circled in the panel B insets using colors corresponding to these waveforms. (E) Time course of overall neural modulation for each syllable after hearing the prompt (left alignment) and when speaking (right alignment). Population neural distances between the spoken and silent conditions were calculated from TCs using an unbiased measurement of firing rate vector differences (see Methods). This metric yields signed values near zero when population firing rates are essentially the same between conditions. Firing rate changes were significantly greater (p < 0.01, sign-rank test) during speech production (comparison epoch shown by the black window after Go) compared to after hearing the prompt (gray window after Prompt). Each syllable’s mean modulation across the comparison epoch is shown with the corresponding color’s horizontal tick to the right of the plot. The vertical scale is the same across participants, revealing the larger speech-related modulation in T5’s recordings.

Figure 1.

Figure 1—figure supplement 1. Prompted speaking tasks behavior.

Figure 1—figure supplement 1.

(A) Acoustic spectrograms for the participants’ spoken syllables. Power was averaged over all analyzed trials. Note that da is missing for T5 because he usually misheard this cue as ga or ba. (B) Same as panel A but for the words datasets. (C) Reaction time distributions for each dataset.
Figure 1—figure supplement 2. Example threshold crossing spike rates.

Figure 1—figure supplement 2.

The left column shows –4.5 × root mean square voltage threshold crossing firing rates during the syllables task, recorded on the electrodes from which the single neuron spikes in Figure 1D were spike sorted. The right column shows three additional example electrodes’ firing rates. Insets shows the unsorted threshold crossing spike waveforms.
Figure 1—figure supplement 3. Neural activity while speaking short words.

Figure 1—figure supplement 3.

(A) Firing rates during speaking of short words for three example neurons (blue spike waveform insets) and three example electrodes’ −4.5 × RMS threshold crossing spikes (gray waveform insets). Data are presented similarly to Figure 1D and are from the T5-words and T8-words datasets. (B) Firing rate differences compared to the silent condition across the population of threshold crossings, presented as in Figure 1E. The ensemble modulation was significantly greater when speaking words compared to when hearing the prompts (p<0.01, sign-rank test).
Figure 1—figure supplement 4. Neural correlates of spoken syllables are not spatially segregated in dorsal motor cortex.

Figure 1—figure supplement 4.

(A) Electrode array maps similar to Figure 1B insets are shown for each syllable separately to reveal where modulation was observed during production of that sound. Electrodes where the TCs firing rate changed significantly during speech, as compared to the silent condition, are shown as colored circles. Non-modulating electrodes are shown as larger gray circles, and non-functioning electrodes are shown as smaller dots. Adding up how many different syllables each electrode’s activity modulates to yields the summary insets shown in Figure 1B. These plots reveal that electrodes were not segregated into distinct cortical areas based on which syllables they modulated to. (B) Histograms showing the distribution of how many different syllables evoke a significant firing rate change for electrode TCs (each participant’s left plot) and sorted single neurons (right plot). The first bar in each plot, which corresponds to electrodes or neurons whose activity only changed when speaking one syllable, is further divided based on which syllable this modulation was specific to (same color scheme as in panel A). This reveals two things. First, single neurons or TCs (which may capture small numbers of nearby neurons) were typically not narrowly tuned to one sound. Second, there was not one specific syllable whose neural correlates were consistently observed on separate electrodes/neurons from the rest of the syllables.
Figure 1—figure supplement 5. Neural activity shows phonetic structure.

Figure 1—figure supplement 5.

(A) The T5-phonemes dataset consists of the participant speaking 420 unique words which together sampled 41 American English phonemes. We constructed firing rate vectors for each phoneme using a 150 ms window centered on the phoneme start (one element per electrode), averaged across every instance of that phoneme. This dissimilarity matrix shows the difference between each pair of phonemes’ firing rate vectors, calculated using the same neural distance method as in Figure 1E. The matrix is symmetrical across the diagonal. Diagonal elements (i.e. within-phoneme distances) were constructed by comparing split halves of each phoneme’s instances. The phonemes are ordered by place of articulation grouping (each group is outlined with a box of different color). (B) Violin plots showing all neural distances from panel A divided based on whether the two compared phonemes are in the same place of articulation group (‘Within group’, red) or whether the two phonemes are from different place of articulation groups (‘Between groups’, black). Center circles show each distribution’s median, vertical bars show 25th to 75th percentiles, and horizontal bars shows distribution means. The mean neural distance across all Within group pairs was 30.6 Hz, while the mean across all Between group pairs was 42.8 Hz (difference = 12.2 Hz). (C) The difference in between-group versus within-group neural distances from panel B, marked with the blue line, far exceeds the distribution of shuffled distances (brown) in which the same summary statistic was computed 10,000 times after randomly permuting the neural distance matrix rows and columns. These shuffles provide a null control in which the relationship between phoneme pairs’ neural activity differences and these phonemes’ place of articulation groupings are scrambled. (D) A hierarchical clustering dendrogram based on phonemes’ neural population distances from panel A. At the bottom level, each phoneme is placed next to the (other) phoneme with the most similar neural population activity. Successive levels combine nearest phoneme clusters. By grouping phonemes based solely on their neural similarities (rather than one specific trait like place of articulation, indicated here with the same colors as in the panel A groupings), this dendrogram provides a complementary view that highlights that many neural nearest neighbors are phonetically similar (e.g. /d/ and /g/ stop-plosives, /θ/ and /v/ fricatives, /ŋ/ and /n/ nasals) and that related phonemes form larger clusters, such as the left-most major branch of mostly vowels or the sibilant cluster /s/, /ʃ/, and /dʒ/. At the same time, there are some phonemes that appear out of place, such as plosive consonant /b/ appearing between vowels /ɑ/ and /ɔ/ (we speculate this could reflect neural correlates of co-articulation from the vowels that frequently followed the brief /b/ sound).

Video 1. Example audio and neural data from eleven contiguous trials of the prompted syllables speaking task.

Download video file (8.5MB, mp4)

The audio track was recorded during the task and digitized alongside the neural data; it starts with the two beeps indicating trial start, after which the syllable prompt was played from computer speakers, followed by the go cue clicks, and finally the participant speaking the syllable. The video shows the concurrent −4.5 × RMS threshold crossing spikes rate on each electrode. Each circle corresponds to one electrode, with their spatial layout corresponding to electrodes’ locations in motor cortex as in the Figure 1B inset. Each electrode’s moment-by-moment color and size represent its firing rate (soft-normalized with a 10 Hz offset, smoothed with a 50 ms s.d. Gaussian kernel). The color map goes from pink (minimum rate across electrodes) to yellow (maximum rate), while size varies from small (minimum rate) to large (maximum rate). Non-functioning electrodes are shown as small gray dots. To assist the viewer in perceiving the gestalt of the population activity, a larger central disk shows the mean firing rate across all functioning electrodes, without soft-normalization. Data are from the T5-syllables dataset, trial set #23.

Three observations lead us to believe that this neural activity is related to motor cortical control of the speech articulators (Chartier et al., 2018; Conant et al., 2018; Mugler et al., 2018) rather than perception or language. First, modulation was significantly stronger when speaking compared to after hearing the auditory prompts: the neural population firing rate change compared to the silent condition was 4.03 times higher after the go cue compared to after the audio prompt for the T5-syllables dataset, 2.90x for the T8-syllables dataset (Figure 1E), 6.71x higher for the T5-words dataset, and 2.12x for T8-words (Figure 1—figure supplement 3). Modulation following the audio prompt, although small, was significant when compared to a 1 s epoch just prior to the prompt (p<0.01, sign-rank test, all four datasets). In this study, we are unable to disambiguate whether this prompt-related response reflects auditory perception, movement preparation, or small overt movements preceding vocalization. We will primarily focus on the larger, later neural modulation putatively related to speech production.

Second, analysis of an additional dataset in which participant T5 spoke 41 different phonemes revealed that neural population activity showed phonemic structure (Figure 1—figure supplement 5): for example, when phonemes were grouped by place of articulation (Bouchard et al., 2013; Lotte et al., 2015; Moses et al., 2019), population firing rate vectors were significantly more similar between phonemes within the same group than between phonemes in different groups (p<0.001, shuffle test). Third, in both participants, 99 of 120 electrodes that were active during speaking syllables (24 out of 25 sorted neurons) also were active during production of at least one of seven non-speech orofacial movements (Figure 2 and Figure 2—figure supplement 1). We also observed weak but significant firing rate correlations with breathing (Figure 2—figure supplement 2). Modulation for speaking was stronger than for unattended breathing (~4.7 x) and instructed breathing (~2.6 x), and modulation for attempted arm movements was stronger than for speaking and orofacial movements (~2.8 x, Figure 2—figure supplement 3).

Figure 2. The same motor cortical population is also active during non-speaking orofacial movements.

(A) Both participants performed an orofacial movement task during the same research session as their syllables speaking task. Examples of single neuron firing rates during seven different orofacial movements are plotted in colors corresponding to the movements in the illustrated legend above. The ‘stay still’ condition is plotted in black. The same three example neurons from Figure 1D are included here. The other three neurons were chosen to illustrate a variety of observed response patterns. (B) Electrode array maps indicating the number of different orofacial movements for which a given electrode’s −4.5 × RMS threshold crossing rates differed significantly from the stay still condition. Data are presented similarly to the Figure 1B insets. Firing rates on most functioning electrodes modulated for multiple orofacial movements. See Figure 2—figure supplement 1 for individual movements’ electrode response maps. (C) Breakdown of how many neurons’ (top) and electrodes’ TCs (bottom) exhibited firing rate modulation during speaking syllables only (red), non-speaking orofacial movements only (blue), both behaviors (purple), or neither behavior (gray). A unit or electrode was deemed to modulate during a behavior if its firing rate differed significantly from silence/staying still for at least one syllable/movement.

Figure 2.

Figure 2—figure supplement 1. Neural correlates of orofacial movements are not spatially segregated in dorsal motor cortex.

Figure 2—figure supplement 1.

The orofacial movements data were analyzed and are presented like the speaking data in Figure 1—figure supplement 4.
Figure 2—figure supplement 2. Dorsal motor cortex correlates of breathing.

Figure 2—figure supplement 2.

(A) We recorded neural data and a breathing proxy (the stretch of a belt wrapped around participant T5’s abdomen) while he performed a BCI cursor task or sat idly (‘unattended’ breathing condition, left) or intentionally breathed in and out based on on-screen instructions (‘instructed’, right). Example continuous breath belt measurements from this T5-breathing dataset are shown. Violet dots mark the identified breath peak times. (B) Gray traces show example respiratory belt measurements for 200 unattended (left) and instructed (right) breaths, aligned to the breath peak. The means of all 727 unattended and 275 instructed breaths are shown in black. (C) Breath-aligned firing rates for three example electrodes’ TCs (mean ± s.e.m) during unattended breathing. Breath-related modulation depth was calculated as the peak-to-trough firing rate difference. Horizontal dashed lines show the p=0.01 modulation depths for a shuffle control in which breath peak times were chosen at random. Significantly greater breath-related modulation in either the unattended or instructed condition was observed on 71 out of 121 electrodes. (D) Breath-aligned firing rates were also calculated for sortable SUA, whose spike waveform snippets are shown as in Figure 1D. Firing rates of three example neurons with large waveforms are shown during unattended breathing. Overall, 17 out of 19 sorted units had significant modulation in either the unattended or instructed condition. This argues against the breath-related modulation being an artifact of breath-related electrode array micromotion causing a change in TCs firing rates by bringing additional units in or out of recording range. (E) Unattended (top) and instructed (bottom) breath-related modulation depth for each functioning electrode’s TCs, presented as in Figure 1B. Two extrema values exceeded the 30 Hz color range maximum. (F) Histograms of modulation depths during unattended breathing (gray), instructed breathing (blue), and speaking-related modulation depths (red) for all functioning electrodes in the T5-breathing and T5-syllables datasets, respectively. Two outlier datums are grouped in a > 60 Hz bin. Dorsal motor cortical modulation was greater for speaking than breathing (all three distributions significantly differed from one another, p<0.001, rank-sum tests). (G) Mean population firing rates (averaged across all functioning electrodes) are shown during unattended (gray) and instructed (blue) breathing, aligned to breath peak, and during speaking (red), aligned to acoustic onset. The vertical offsets of the breathing traces have been bottom-aligned to the speaking trace to facilitate comparison. Note that the mean population rate can obscure counter-acting firing rate increases and decreases across electrodes; panel F provides a complementary description capturing both rate increases and decreases.
Figure 2—figure supplement 3. Dorsal motor cortex modulates more strongly during attempted arm and hand movements than orofacial movements and speaking.

Figure 2—figure supplement 3.

Participant T5 performed a variety of instructed movements. We compared neural activity when attempting orofacial and speech production movements (red), versus when attempting arm and hand movements (black). Note that T5, who has tetraplegia, was able to actualize all the orofacial and speech movements, but not all the arm and hand movements. Each point corresponds to the mean TCs firing rate modulation for a single instructed movement (movements are listed in the order corresponding to their points’ vertical coordinates) in the T5-comparisons dataset. Modulation strength is calculated by taking the firing rate vector neural distance (like in Figure 1E) between when T5 attempted/performed the instructed movement and otherwise similar trials where the instruction was to ‘do nothing’. Bars show the mean modulation across all the movements in each grouping. A rank-sum test was used to compare the distributions of orofacial and speech and arm and hand modulations.

Speech can be decoded from intracortical activity on individual trials

We next performed a decoding analysis to quantify how much information about the spoken syllable or word was present in the time-varying neural activity. Multi-class support vector machines were used to predict the spoken sound (or silence) from single trial TCs and high-frequency LFP power (Figure 3). Cross-validated prediction accuracies for syllables were 84.6% for T5 (10 classes, mean chance accuracy was 10.1% across shuffle controls) and 54.7% for T8 (11 classes, chance was 8.6%). Word decoding accuracies were 83.5% for T5 (11 classes, chance was 9.1%) and 61.5% for T8 (11 classes, chance was 9.3%). We also used the same method to decode neural activity from 0 to 500 ms after the speech prompt and found that classification accuracies were only marginally better than chance (overall accuracies between 11.1% and 16.6% across the four datasets, p<0.05 versus shuffle controls in three of the four datasets; Figure 3C gray bars). The much higher neural discriminability of syllables and words during speaking rather than after hearing the audio prompt is consistent with the previously enumerated evidence that modulation in this cortical area is related to speech production.

Figure 3. Speech can be decoded from intracortical activity.

Figure 3.

(A) To quantify the speech-related information in the neural population activity, we constructed a feature vector for each trial consisting of each electrode’s spike count and HLFP power in ten 100 ms bins centered on AO. For visualization, two-dimensional t-SNE projections of this feature vector are shown for all trials of the T5-syllables dataset. Each point corresponds to one trial. Even in this two-dimensional view of the underlying high-dimensional neural data, different syllables’ trials are discriminable and phonetically similar sounds’ clusters are closer together. (B) The high-dimensional neural feature vectors were classified using a multiclass SVM. Confusion matrices are shown for each participant’s leave-one-trial-out classification when speaking syllables (top row) and words (bottom row). Each matrix element shows the percentage of trials of the corresponding row’s sound that were classified as the sound of the corresponding column. Diagonal elements show correct classifications. (C) Bar heights show overall classification accuracies for decoding neural activity during speech (black bars, summarizing panel B) and decoding neural activity following the audio prompt (gray bars). Each small point corresponds to the accuracy for one class (silence, syllable, or word). Brown boxes show the range of chance performance: each box’s bottom/center/top correspond to minimum/mean/maximum overall classification accuracy for shuffled labels.

During speaking, decoding accuracies for all individual sounds were above chance (p<0.01, shuffle test). Decoding mistakes (Figure 3B) and low-dimensional representations (Figure 3A) tended to follow phonetic similarities (e.g. ba and ga, a and ae). This observation is consistent with previous ECoG studies (Bouchard et al., 2013; Cheung et al., 2016; Livezey et al., 2019; Moses et al., 2019; Mugler et al., 2014), although the larger neural differences we observed between unvoiced k and p and the beginning of their voiced counterparts at the start of ga and ba suggests strong laryngeal tuning (Dichter et al., 2018). These neural correlate similarities may reflect similarities in the underlying articulator movements (Chartier et al., 2018; Lotte et al., 2015; Mugler et al., 2018).

Neural population dynamics exhibit low-dimensional structure during speech

These multielectrode recordings enabled us to observe motor cortical dynamics during speech at their fundamental spatiotemporal scale: neuron spiking activity. Specifically, we examined whether two known key dynamical features of motor cortex firing rates during arm reaching were also present during speaking. Importantly, both of these features were revealed when looking not at individual neurons’ firing rates, but rather were seen when examining the time courses of population activity ‘components’ that act as lower dimensional building blocks (or condensed summaries) of the many individual neurons’ activities (Gallego et al., 2017; Pandarinath et al., 2018; Saxena and Cunningham, 2019). The first prominent neural population dynamics feature (‘dynamical motif’) we tested for is inspired by previous nonhuman primate (NHP) experiments showing that the neural state undergoes a rapid change during movement initiation which is dominated by a condition-invariant signal (CIS) (Kaufman et al., 2016). In that study, Kaufman and colleagues provide a comprehensive exposition on why a large neural component that is highly invariant across many different arm reaches is a non-trivial feature of neural population data and could, despite its non-specificity, be important to the overall computations being performed. A similar CIS was recently also reported during NHP grasping (Intveld et al., 2018).

The second dynamical motif we tested for follows studies of NHP arm reaches (Churchland et al., 2012; Kaufman et al., 2016) and human point-to-point hand movements (Pandarinath et al., 2015), which showed that subsequent peri-movement neural ensemble activity is characterized by orderly rotatory dynamics. That is, a substantial portion of moment-by-moment firing rate changes can be explained by a simple rotation of the neural state in a plane that summarizes the correlated activity of groups of neurons. These observations, in concert with neural network modeling (Kaufman et al., 2016), have led to a model of motor control in which, prior to movement, inputs specifying the movement goal create attractor dynamics toward an advantageous initial condition (Shenoy et al., 2013). During movement initiation, a large transient input ‘kicks’ the network into a different state from which activity evolves according to rotatory dynamics such that muscle activity is constructed from an oscillatory basis set (akin to composing an arbitrary signal from a Fourier basis set) (Churchland et al., 2012; Sussillo et al., 2015).

We tested whether motor cortical activity during speaking also exhibits these dynamics by applying the analytical methods of Churchland et al. (2012) and Kaufman et al. (2016). These analyses used two different dimensionality reduction techniques (Cunningham and Yu, 2014) to reveal latent low-dimensional structure in the trial-averaged firing rates for different conditions (here, speaking different words). Both methods sought to find a modest number of linear weightings of different electrodes’ firing rates (forming the aforementioned neural population activity ‘components’) that capture a large fraction of the overall variance. This is akin to principal components analysis (PCA), but unlike PCA, each method also looks for a specific form of neural population structure: jPCA (Churchland et al., 2012) seeks components with rotatory dynamics, whereas dPCA (Kaufman et al., 2016; Kobak et al., 2016) decomposes neural activity into CI and condition-dependent (CD) components. Importantly, these methods do not spuriously find the sought dynamical structure when it is not present in the data (Churchland et al., 2012; Elsayed and Cunningham, 2017; Kaufman et al., 2016; Kobak et al., 2016; Pandarinath et al., 2015).

We found that these two prominent population dynamics motifs were indeed also present during speaking. Like in Kaufman et al. (2016), the largest dPCA component summarizing each participants’ neural activity during movement initiation was largely CI: this component was 98.7% CI in participant T5, and 87.3% CI in participant T8 (Figure 4A). Figure 4B shows that in T5, this ‘CIS1’ component, which rapidly increased after the go cue, was essentially identical regardless of which word was spoken. In T8, the CIS1 was not as cleanly condition-invariant, but nonetheless showed a similar increase following the go cue for each word. We also found this condition-invariant neural population activity component in all four additional datasets that we examined: T5’s and T8’s syllables task datasets, as well as two additional replication datasets in which participant T5 spoke just five of the words (Figure 4—figure supplement 1). These results were also robust across different choices of how many dPCs to summarize the neural population activity with (Figure 4—figure supplement 2).

Figure 4. A condition-invariant signal during speech initiation.

(A) A large component of neural population activity during speech initiation is a condition-invariant (CI) neural state change. Firing rates from 200 ms before to 400 ms after the go cue (participant T5) and 100 ms to 700 ms after the go cue (T8) were decomposed into dPCA components like in Kaufman et al. (2016). Each bar shows the relative variance captured by each dPCA component, which consists of both CI variance (red) and condition-dependent (CD) variance (blue). These eight dPCs captured 45.1% (T5-words) and 8.4% (T8-words) of the overall neural variance, which includes non-task related variability (‘noise’). (B) Neural population activity during speech initiation was projected onto the first dPC dimension; this ‘CIS1’ is the first component from panel A. Traces show the trial-averaged CIS1 activity when speaking different words, denoted by the same colors as in Figure 3B.

Figure 4.

Figure 4—figure supplement 1. Further details of neural population dynamics analyses and additional datasets.

Figure 4—figure supplement 1.

(A) Cumulative trial-averaged firing rate variance explained during each dataset’s speech initiation epoch as a function of the number of PCA dimensions (top set of curves) and dPCA dimensions (bottom set of curves) used to reduce the dimensionality of the data. The dotted lines mark the eight dimensions used for dPCA in the condition-invariant signal analyses for panels C-E and in Figure 4. In addition to the T5-words and T8-words datasets (blue curves) shown in Figure 4, this figure also shows results from the T5-syllables and T8-syllables datasets (orange curves) as well as the T5-5words-A and T5-5words-B replication datasets (gold and purple curves). (B) Cumulative trial-averaged firing rate variance explained during all six datasets’ speech generation epochs. The dotted lines mark the six dimensions used for jPCA in the rotatory dynamics analyses for panels E-H and Figure 5. PCA and jPCA curves are one and the same because jPCA operates within the neural subspace found using PCA. (C) Firing rates during the speech initiation epoch of each dataset were decomposed into dPCA components like in Figure 4A. The additional inset matrices quantify the relationships between dPC components as in Kobak et al. (2016): each element of the upper triangle shows the dot product between the corresponding row’s and column’s demixed principal axes (i.e. the dPCA encoder dimensions), while the bottom triangle shows the correlations between the corresponding row’s and column’s demixed principal components (i.e. the neural data projected onto this row’s and column’s demixed principal decoder axes). More red or blue upper triangle elements denote that this pair of neural dimensions are less orthogonal, while more red or blue lower triangle elements denote that this pair of components (summarized population firing rates) have a similar time course. Stars denote pairs of dimensions that are significantly non-orthogonal at p<0.01 (Kendall correlation). (D) Neural population activity projected onto the first dPC dimension, shown as in Figure 4B for all six datasets. (E) The subspace angle between the CIS1 dimension from panels C,D and the top jPC plane from panel F, for each dataset. Across all six datasets, the CIS1 was not significantly non-orthogonal to any of the six jPC dimensions (p>0.01, Kendall correlation) except for the angle between CIS1 and jPC5 in the T5-words dataset (70.5°, p<0.01). (F) Neural population activity for each speaking condition was projected into the top jPCA plane as in Figure 5A, for all six datasets. (G) The same statistical significance testing from Figure 5B was applied to all datasets to measure how well a rotatory dynamical system fit ensemble neural activity. (H) Rotatory neural population dynamics were not observed when the same jPCA analysis was applied on each dataset’s neural activity from 100 to 350 ms after the audio prompt. Rotatory dynamics goodness of fits are presented as in the previous panel.
Figure 4—figure supplement 2. Neural population dynamics when viewed across a range of reduced dimensionalities.

Figure 4—figure supplement 2.

(A) Condition-invariant neural population dynamics during speech initiation are summarized as in Figure 4 for the T5-words (top) and T8-words (bottom) datasets, but now varying the number of dPCA components used for dimensionality reduction. The eight dPCs results shown in Figure 4 and Figure 4—figure supplement 1 are outlined with a dashed gray box. (B) Rotatory neural population dynamics during speaking are summarized as in Figure 5 for the T5-words (top) and T8-words (bottom) datasets, but now varying the number of principal components used for the initial dimensionality reduction (after which the jPCA algorithm looks for rotatory planes). The six PCs results shown in Figure 5 and Figure 4—figure supplement 1 are outlined.

We attribute the difference in how condition-invariant the CIS1 component was between the two participants to the much smaller speech task-related neural modulation recorded in participant T8 compared to in T5, as demonstrated in Figure 1—figure supplement 3B and the lower classification accuracies of Figure 3. The practical consequence of T8’s substantially weaker speech-related modulation is that much more of the neural population activity that dimensionality reduction tries to summarize was not task-relevant (i.e. is ‘noise’ for the purpose of these analyses). This lower signal-to-noise ratio can also be appreciated in how the ‘elbow’ of T8’s cumulative neural variance explained by PCA or dPCA components (Figure 4—figure supplement 1A,B) occurs after fewer components and explains far less overall variance.

Lastly, we looked for rotatory population dynamics around the time of acoustic onset. Figure 5A shows ensemble firing rates projected into the top jPCA plane (i.e. the subspace defined by jPC1 and jPC2). In participant T5, all conditions’ neural state trajectories rotated in the same direction (similarly to Churchland et al., 2012; Pandarinath et al., 2015), and rotatory dynamics could explain substantial variance in how population activity evolved moment-by-moment during speaking. Application of a recent population dynamics hypothesis testing method (Elsayed and Cunningham, 2017) revealed that this rotatory structure was significantly stronger than expected by chance in T5’s speaking data, but not in T8’s speaking data (Figure 5B) or when this analysis was applied to neural activity following the audio prompt (Figure 4—figure supplement 1H). As was the case for the condition-invariant dynamics, these results were also consistent across additional datasets (Figure 4—figure supplement 1E–H) and across the choice of how many PCA dimensions in which to look for rotatory dynamics (Figure 4—figure supplement 2B). We again attribute the observed between-participants difference to T8’s smaller measured neural responses during speech, which likely reflect his older arrays’ lower signal quality. Consistent with this, T8’s BCI computer cursor control performance was also substantially worse than T5’s (Pandarinath et al., 2017). Other factors that could also have contributed to T8’s reduced speech-related neural activity include his tendency to speak quietly and with less clear enunciation (consistent with Jiang et al., 2016), array placement differences, and differences in cortical maps between individuals (Farrell et al., 2007).

Figure 5. Rotatory neural population dynamics during speech.

Figure 5.

(A) The top six PCs of the trial-averaged firing rates from 150 ms before to 100 ms after acoustic onset in the T5-words and T8-words datasets were projected onto the first jPCA plane like in Churchland et al. (2012). This plane captures 38% of T5’s overall population firing rates variance, and rotatory dynamics fit the moment-by-moment neural state change with R2 = 0.81 in this plane and 0.61 in the top 6 PCs. In T8, this plane captures 15% of neural variance, with a rotatory dynamics R2 of only 0.32 in this plane and 0.15 in the top six PCs. (B) Statistical significance testing of rotatory neural dynamics during speaking. The blue vertical line shows the goodness of fit of explaining the evolution in the top six PC’s neural state from moment to moment using a rotatory dynamical system. The brown histograms show the distributions of this same measurement for 1000 neural population control surrogate datasets generated using the tensor maximum entropy method of Elsayed and Cunningham (2017). These shuffled datasets serve as null hypothesis distributions that have the same primary statistical structure (mean and covariance) as the original data across time, electrodes, and word conditions, but not the same higher order statistical structure (e.g. low-dimensional rotatory dynamics).

Videos 2 and 3 show the temporal relationship between these two dynamical motifs – an initial condition-invariant neural state shift, followed by rotatory dynamics. Neural state rotations occurred after the condition invariant translation; by comparison, in Kaufman et al. (2016) the neural rotations also lagged the CIS shift, but in the monkey arm reaching data these rotations either partially overlapped with, or more immediately followed, the CIS shift. We note that existing models of how a condition-invariant signal ‘kicks’ dynamics into a different state space region where rotatory dynamics unfold (Kaufman et al., 2016; Sussillo et al., 2015) do not require that the CIS and rotatory dynamics must be orthogonal, but in these data we did observed that the CIS1 and jPCA dimensions were largely orthogonal (Figure 4—figure supplement 1E).

Video 2. The progression of neural population activity during the prompted words task is summarized with dimensionality reduction chosen to highlight the condition-invariant ‘kick’ after the go cue, followed by rotatory population dynamics.

Download video file (1.8MB, mp4)

T5-words dataset neural state space trajectories are shown from 2.5 s before go cue to 2.0 s after go. Each trajectory corresponds to one word condition’s trial-averaged firing rates, aligned to the go cue. The neural states are projected into a three-dimensional space consisting of the CIS1 dimension (as in Figure 4) and the first two jPC dimensions (similar to Figure 5, except that for this visualization we enforced that the jPC plane be orthogonal to the CIS1; see Materials and methods). The trajectories change color based on the task epoch: gray is before the audio prompt, blue is after the prompt, and then red-to-green is after the go cue, with conditions ordered as in Figure 5.

Video 3. The same neural trajectories as Video 2, but aligned to acoustic on (AO), are shown from 3.5 s before AO to 1.0 s after AO.

Download video file (1.1MB, mp4)

Discussion

There are three main findings from this study. First, these data suggest that ‘hand knob’ motor cortex, an area not previously known to be active during speaking (Breshears et al., 2015; Dichter et al., 2018; Leuthardt et al., 2011; Lotte et al., 2015), may in fact participate, or at least receive correlates of, neural computations underlying speech production. Speech-related single-neuron modulation might have been missed by previous studies due to the coarser resolution of ECoG (Chan et al., 2014). If this finding holds true in the wider population, this would underscore that the familiar ‘motor homunculus’ (Penfield and Boldrey, 1937) is overly simplistic. It is generally recognized that motor cortex does not rigidly follow a sequential point-to-point somatotopy, and indeed, Penfield and colleagues were aware of this and intended for their diagram to be a simplified summary of results showing partially overlapping motor fields that also varied substantially across individuals (Catani, 2017). However, the patchy mosaicism amongst nearby body parts in the current view of precentral gyrus organization still features a dorsal-to-ventral progression and separation of the major body regions (leg, arm, head) (Farrell et al., 2007; Schieber, 2001).

The presence of neurons responding to mouth and tongue movements in the dorsal ‘arm and hand’ area of motor cortex indicates that sensorimotor maps for different body parts are even more widespread and overlapping than previously thought. Given our previous finding that activity from these same arrays encodes intended arm and hand movements (Pandarinath et al., 2017), these observations are consistent with the hypothesis that the systems for speech and manual gestures are interlocked (Gentilucci and Stefani, 2012; Rizzolatti and Arbib, 1998; Vainio et al., 2013). However, emerging work from our group showing that neurons in this area also modulate during attempted movements of the neck and legs (Willett et al., 2019) suggests that much of the body is represented (to varying strengths) in dorsal motor cortex. Thus, the observed neural overlap between hand and speech articulators may be a consequence of distributed whole-body coding, rather than a privileged speech-manual linkage.

Our data suggest that the observed neural activity reflects movements of the speech articulators (the tongue, lips, jaw, and larynx): modulation was greater during speaking than after hearing the prompt; the same neural population modulated during non-speech orofacial movements; and in T5, the neural correlates of producing different phonemes grouped according to these phonemes’ place of articulation. We also found that firing rates showed modest correlation with T5’s unattended and instructed breathing, which invites the question of how this activity relates to the precise control of breathing necessary for speaking and whether breath-related activity differs depending on behavioral context. A deeper understanding of how motor cortical spiking activity relates to complex speaking behavior will require future work connecting it to continuous articulatory (Chartier et al., 2018; Conant et al., 2018; Mugler et al., 2018) and respiratory kinematics and, ideally, the underlying muscle activations.

An important unanswered question, however, is to what extent these results were potentially influenced by cortical remapping due to tetraplegia. While we cannot rule this out, we believe that remapping of face representation to the hand knob area is unlikely. Despite these participants’ many years of paralysis, the sites we recorded from still strongly modulate during attempted hand and arm movements (Ajiboye et al., 2017; Brandman et al., 2018; Pandarinath et al., 2017). We also verified in participant T5 that modulation during attempted arm movements was stronger than during speech production. Our ongoing work also indicates that this area modulates during attempts to move other body parts (e.g. the leg) which, like the arm, are also paralyzed (Willett et al., 2019). Taken together, these results are inconsistent with this area being ‘taken over’ by functions related to the participants’ remaining capability to make orofacial movements. Furthermore, motor cortical remapping following arm amputation was recently shown to be smaller than previously thought (Wesselink et al., 2019), and in particular much smaller than what would be needed to move lip representations to hand cortex (Makin et al., 2015). On the sensory side, emerging evidence suggests that cortical reorganization following injury in adults is more limited than previously thought (Makin and Bensmaia, 2017), and a recent microstimulation study in the hand somatosensory cortex of a person with tetraplegia did not find functional reorganization (Flesher et al., 2016). While these threads of evidence argue against remapping, definitively resolving this ambiguity would require intracortical recording from this eloquent brain area in able-bodied people.

Assuming that these results are not due to injury-related remapping, we are left with the question of why this speech-related activity is found in dorsal ‘arm and hand’ motor cortex. Speech is spared following lesions in this area (Chen et al., 2006; Tei, 1999), indicating that it is not necessary for speech production. Nonetheless, it is possible that dorsal motor cortex plays some supporting role in speaking, perhaps contributing to more demanding speaking tasks, or that this activity reflects speech efference copy for coordinating orofacial and upper extremity movements. This would be in line with theoretical arguments that high dimensional representations resulting from mixed selectivity – in this case, both within major body regions (a given neuron being tuned for multiple arm movements or for multiple orofacial movements) and across major body regions (neurons being tuned for both arm and face movements) – enable more complex computations (Fusi et al., 2016) such as coordinating movements across the body. We anticipate that it will require substantial future work to understand why speech-related activity co-occurs in the same motor cortical area as arm and hand movement activity, but that this line of inquiry may reveal important principles of how sensorimotor control is distributed across the brain (Musall et al., 2019; Stringer et al., 2019).

Our second main finding is that, based on offline decoding results, intracortical recordings show promise as signal sources for BCIs to restore speech to people with some forms of anarthria. Decoding the neural correlates of attempted speech production (Brumberg et al., 2011) into audible sounds or text may be more desirable than approaches that decode covert internal speech (Leuthardt et al., 2011; Martin et al., 2016) or more abstract elements of language (Chan et al., 2011; Yang et al., 2017) because decoding attempted movements leverages existing neural machinery that separates internal monologue and speech preparation from intentional speaking. The present results compare favorably to previously published decoding accuracies using ECoG (Mugler et al., 2014; Ramsey et al., 2018) despite our dorsal recording locations likely being suboptimal for decoding speech. Multi-electrode arrays placed in ventral motor cortex would be expected to yield even better decoding accuracies. Furthermore, recent order-of-magnitude advances in the number of recording sites on intracortical probes (Jun et al., 2017) point to a path that stretches far forward in terms of scaling the number of distinct sources of information (neurons) for speech BCIs.

That said, these results are only a first step in establishing the feasibility of speech BCIs using intracortical electrode arrays. We decoded amongst a limited set of discrete syllables and words in participants who are able to speak; future studies will be needed to assess how well intracortical signals can be used to discriminate between a wider set of phonemes (Brumberg et al., 2011; Mugler et al., 2014), in the absence of overt speech (Brumberg et al., 2011; Martin et al., 2016), and to synthesize continuous speech (Akbari et al., 2019; Anumanchipalli et al., 2019; Makin et al., 2019). We also observed worse decoding performance in participant T8, highlighting the need for future studies in additional participants to sample the distribution of how much speech-related neural modulation can be expected, and what speech BCI performance these signals can support.

Our third main finding is that two motor cortical population dynamical motifs present during arm movements were also significant features of speech activity. We observed a large condition-invariant change at movement initiation in both participants, and rotatory dynamics during movement generation in the one of two participants whose arrays recorded substantially more modulation. We speculate that these neural state rotations are well-suited for generating descending muscle commands driving the out-and-back articulator movements that form the kinematic building blocks of speech (Chartier et al., 2018; Mugler et al., 2018). The presence of these dynamics during both reaching and speaking could indicate a conserved computational mechanism that is ubiquitously deployed across multiple behaviors to shift the circuit dynamics from withholding movement to generating the appropriate muscle commands from an oscillatory basis set. Testing and refining this hypothesis calls for examining whether these two dynamical motifs are present across an even wider range of behaviors and body parts. For instance, there is emerging evidence that rotatory dynamics may be absent in movements with a greater role of sensory feedback, such as hand grasping (Suresh et al., 2019).

This interpretation should also be tempered by the major unresolved question of whether these dynamics in dorsal motor cortex play a causal role in speaking and/or echo similar dynamics in other areas, such as ventral motor cortex, which are more directly involved in speech (Bouchard et al., 2013). An alternative interpretation is that if dorsal motor cortex merely receives an efference copy or ‘coordination’ signal about speech articulator movements, its dynamics may resemble those during arm reaching because this is what the inherent properties of the local circuit are set up to generate – even if in the speech case, this activity is not helping construct muscle activities. Testing these hypotheses will require future research involving recording from the speech articulator muscles (analogous to recording from arm muscles in Churchland et al., 2012), causally stimulating the circuit (Dichter et al., 2018), and examining whether these neural ensemble dynamical motifs are present during speech production in ventral (speech) motor cortex.

Materials and methods

Participants

The two participants in this study were enrolled in the BrainGate2 Neural Interface System pilot clinical trial (ClinicalTrials.gov Identifier: NCT00912041). The overall purpose of the study is to obtain preliminary safety information and demonstrate proof of principle that an intracortical brain-computer interface can enable people with tetraplegia to communicate and control external devices. Permission for the study was granted by the U.S. Food and Drug Administration under an Investigational Device Exemption (Caution: Investigational device. Limited by federal law to investigational use). The study was also approved by the Institutional Review Boards of Stanford University Medical Center (protocol #20804), Brown University (#0809992560), University Hospitals of Cleveland Medical Center (#04-12-17), Partners HealthCare and Massachusetts General Hospital (#2011P001036), and the Providence VA Medical Center (#2011–009). Both participants gave informed consent to the study and publications resulting from the research, including consent to publish photographs and audiovisual recordings of them.

Participant ‘T5’ (male, right-handed, 64 years old at the time of the study) was diagnosed with C4 AIS-C spinal cord injury 10 years prior to these research sessions. He retained the ability to weakly flex his left elbow and fingers and some slight and inconsistent residual movement of both the upper and lower extremities. T5 was able to speak normally and converse naturally without hearing assistance, but had some trouble hearing from his left ear.

Participant ‘T8’ (male, right-handed, 56 years old at the time of the study) was diagnosed with C4 AIS-A spinal cord injury 11 years prior to these sessions. He retained restricted and non-functional voluntary shoulder girdle motion on both sides, and non-functional voluntary finger extension on his left side. He had no sensation below the shoulder. T8 was able to speak normally and converse naturally with the assistance of hearing aids in both his ears.

Prompted speaking tasks

Participants performed a syllables task consisting of discrete trials in which they spoke out loud one of 10 different phonemes or consonant-vowel syllables in response to an auditory prompt. These prompts were i (as in ‘beet’); ae (as in ‘bat’); a (as in ‘bot’); u (as in ‘boot’); ba; da; ga; sh (as in the start of ‘shot’), and the unvoiced k and p. All pronunciations were American English. Video 1 provides a continuous audio recording of one set of each type of syllables task trial.

Participants sat comfortably in a chair facing a microphone in a quiet room. They were instructed to refrain from attempting movements or speaking during trials except when prompted to speak by a custom experiment control software written in MATLAB (The Mathworks). During trials, they were also asked to fixate on the same object in front of them. Each trial began with two beeps to alert the participant that the trial was starting. Approximately 1 s after the start of the second beep, a pre-recorded syllable prompt was played via computer speakers. Two clicks played ~2 s after the start of the prompt served as the go cue that instructed the participant to speak back the prompted sound. The next trial started 2.8 s after the start of the second click. There was also an eleventh ‘silent’ condition which was identical to the spoken syllables trials, except that instead of playing a syllable prompt, the speakers played a nearly-silent audio file consisting of ambient background noise recorded in the same environment as the syllable prompts. The participants had been previously instructed not to say anything in response to this silent prompt.

The task was performed in blocks consisting of 10 trial sets. Each set contained 11 trials: one trial of each syllable, plus silence, presented in a randomized order. After the task was explained to each participant, he was given time to practice a few sets of the task until he indicated that he was ready to begin data collection. At the end of each set, we paused the task until the participant indicated that he was ready to continue. These inter-set pauses typically lasted less than 10 s. Participants performed three consecutive blocks of the task during a research session, with longer pauses of several minutes between blocks during which we encouraged the participant to rest, adjust his posture for comfort, and take a drink of water.

Both the audio prompts played by the experiment control computer, and the participant’s voice, were recorded by the microphone (Shure SM-58). This audio signal was recorded via the analog input port of the electrophysiology data acquisition system and digitized at 30 ksps together with the raw neural data (see Neural Recording section). Each trial’s acoustic onset time (AO) was manually determined by visual and auditory inspection of the recorded audio data. During this review, we also excluded infrequent trials where the participant spoke at the wrong time or when the trial was interrupted (for example, if a caregiver entered the room). Isolated sounds can be difficult to discriminate, and our participants sometimes misheard a syllable prompt as a phonetically similar prompt. In particular, T5 misheard the majority of da prompts as ga (or occasionally as ba). Both participants made a few other substitutions between similar syllables. In this study, we were interested in the neural correlates of preparing and then generating speech, which should reflect the syllable that the participant perceived. We therefore labeled these misheard trials based on the spoken, rather than prompted, syllable for subsequent analyses. This left an insufficient number of T5 da trials for subsequent neural analyses; thus, there are 11 conditions shown in T8’s Figure 1 firing rate plots and Figure 3 confusion matrices, but only 10 conditions for T5. The number of trials analyzed for each participant, after excluding trials and re-labeling misheard trials as described above, were: silent (30 trials for T5, 30 trials for T8); i (30, 28); u (30, 31); ae (28, 30); a (30, 30); ba (31, 29); ga (50, 34); da (0, 27); k (30, 27); p (30, 33); sh (30, 30). We refer to these datasets as ‘T5-syllables’ and ‘T8-syllables’.

Participants also performed a words task which was identical to the syllables task except that they heard and repeated back one of 10 short words, rather than syllables, in response to the auditory prompt. Each participant performed three blocks of ten repetitions of each word during one research session. We refer to these datasets as ‘T5-words’ and ‘T8-words’. Two consecutive trials were excluded from the T8-words dataset because of a large electrical noise artifact across almost all electrodes. The specific words, and the number of trials analyzed for each participant, were: ‘beet’ (30 T5 trials, 29 T8 trials); ‘bat’ (30, 29); ‘bot’ (30, 28); ‘boot’ (30, 30); ‘dot’ (30, 29); ‘got’ (29, 29); ‘shot’ (29, 28); ‘keep’ (30, 30); ‘seal’ (30, 30); ‘more’ (30, 30). As with the syllables task, there was also a silent condition (30 T5 trials, 30 T8 trials). During two additional research sessions (as part of a follow-up study), participant T5 performed the words task with only five of the 10 words. The conditions and trial counts in these two replication datasets, which we refer to as ‘T5-5words-A’ and ‘T5-5words-B’, were: ‘seal’ (33 trials in T5-5-words-A, 34 trials in T5-5words-B); ‘shot’ (34, 34); ‘more’ (33, 34); ‘bat’ (34, 33); beet’ (34, 34); and a silent condition (34, 34).

Silent condition trials were assigned a ‘faux AO’ so that neural data from comparable epochs of silent and spoken trials could be visualized and analyzed (for example, for generating trial-averaged, AO-aligned firing rates in Figure 1 or for decoding silent trials’ neural activity in Figure 3). Specifically, each silent trial’s AO was set to equal the mean AO (relative to the go cue) for all the spoken syllables or words during the same block.

Orofacial movement task

Participants also performed an orofacial movement task with a similar trial structure as the syllables and words tasks. Seven different movement conditions were instructed with auditory prompts: ‘mouth open’, ‘lips forward’, ‘lips back’, ‘tongue right’, ‘tongue down’, ‘tongue up’, and ‘tongue left’. An additional ‘stay still’ condition was analogous to the silent condition of the syllables and words tasks. Prior to the first block of the orofacial task, a researcher explained the prompts to the participant, demonstrated the movements, and ran the participant through a few practice sets. Due to clinical trial protocols, we did not collect kinematic tracking data such as electromagnetic midsagittal articulography (Chartier et al., 2018) or ultrasound recordings (Conant et al., 2018). A video recording of the participants’ faces (without markers) did allow the researchers to confirm that the participants were making the instructed movement with acceptable timing precision. Given this limitation, we limited our use of these data to broadly testing for neural responses during orofacial movements, rather than quantifying precise moment-by-moment relationships between neural activity and kinematics.

Similar to the syllables and words task, an orofacial movement trial began with two ready beeps, after which the computer speaker played a movement prompt (e.g. ‘lips forward’). This was followed by the pair of go clicks; the participants were previously informed that they should begin moving after the second click. Approximately 1.9 s after the go cue click, the experiment control system played the verbal command ‘return’, which instructed the participant to return to a neutral orofacial posture (e.g. close the mouth after ‘mouth open’, move the tongue left after ‘tongue right’). The trial ended ~1.9 s after the start of ‘return’. The purpose of using a return cue was so that there was a known epoch after the movement go cue during which we knew that the participant was not yet returning. The return cue also provided the participant with dedicated time to return to a neutral orofacial position, so that all trials would start from roughly the same posture. For T8, the ‘return’ instruction was immediately followed by a go click. However, we observed that T8 started the return movement upon hearing ‘return’ rather than waiting for the go click. We therefore removed the return go click prior to T5’s research sessions, and instead instructed T5 to start the return movement when he heard ‘return’. In the present study, we did not examine the return portion of the orofacial movement task.

Each participant’s orofacial movements and syllables datasets were collected on the same day during the same research session; three blocks of the orofacial movement task immediately followed three blocks of the syllables task. We will refer to these orofacial movements task datasets as ‘T5-movements’ and ‘T8-movements’. No trials were excluded from these datasets; thus, there were 30 trials of each condition for each participant.

Many words task

During an additional research session, participant T5 performed a many words task in which he spoke 420 unique words (from Angrick et al., 2019) designed to broadly sample American English phonemes. These words were visually prompted, with one word appearing per trial. Each trial started with an instruction period in which a red square appeared in the center of a computer screen facing the participant. White text above the square instructed what word the participant should say once given a go cue (e.g. ‘Prepare: ‘Dog’’). This instruction period lasted 1.2 to 1.8 s (mean 1.4 s, exponential distribution) after which the square turned green, the text changed to ‘Go’, and an audible beep was played. This served as the go cue for T5 to speak out loud the instructed word. A second beep occurred 1.5 s later, which marked the end of the trial. The next trial began 1 s later. The 420 words were divided into four sets, with each set spoken during a continuous block of trials with short breaks between blocks. Each word set was repeated three times during this research session, with a given set’s words appearing in a different random order during each block. We call this the ‘T5-phonemes’ dataset.

Breath measurement

T5’s breath-related abdomen movements were measured with a piezo respiratory belt transducer (model MLT1132, ADIntruments). The stretch sensor was wrapped around his abdomen at the point where it maximally expanded during breathing. Analog voltage signals from the belt were input to the neural signal processor via one of its analog input channels. These data were digitized at 30 ksps along with the neural data. Our goal was to test whether there is breath-correlated neural activity during ‘unattended’ breathing (i.e. natural ‘background’ breathing, when the participant was not consciously attending to his breath) and during consciously attended ‘instructed’ breathing. Both of the unattended and instructed conditions were collected during the same research session, and we refer to this as the ‘T5-breathing’ dataset.

For the unattended breathing condition, we recorded neural and breath proxy measurements while T5 performed a BCI computer cursor task as part of a different study, and during an interval where he was resting quietly after completing the BCI task. For the instructed breathing task, we recorded neural and breath proxy measurements while T5 performed a cued breathing task that followed a similar structure as the many words task described in the previous section. On each trial, the on-screen instruction text was either ‘Prepare: Breathe in’ or ‘Prepare: Breathe out’. The order of these two trial types was randomized within consecutive two-trial sets, such that breaths in and breaths out were counterbalanced and no more than two out breaths or two in breaths could be prompted in a row. After a random delay of 1.2 to 1.6 s (mean 1.4 s, exponential distribution), the go cue instructed the participant to breathe in or out according to the instruction. After 1.5 s, an audible beep and the on-screen text changing to ‘Return’ instructed the participant to return to a neural lung inflation position. ‘Return’ stayed on screen for 1.5 s, after which the inter-trial interval was 1 s. A block consisted of 12 trials, after which the participant was given a chance to take a break, relax, and breathe naturally before the next block. The participant reported that this task was comfortable and that he was able to match his breaths to the instructions without difficulty.

Movement comparisons task

The purpose of this task, which was performed on a separate day from the other datasets, was to compare the neural modulation when making orofacial movements and speaking, versus when attempting to make arm and hand movements. The task had a similar visually instructed structure to the instructed breathing task. During the instructed delay period, text displayed the upcoming movement, for example, ‘Prepare: Say Ba’, or ‘Prepare: Open Hand’. There was also a ‘Prepare: Do Nothing’ instruction, which otherwise had the same trial structure as the instructed movements. After a random delay period of between 1400 and 1800 ms, the go cue appeared. During this epoch, T5 attempted to make the instructed movement as best as he could. This resulted in complete movements for all the orofacial and speaking movements and ‘shoulder shrug’, partial movements for some of the arm movements (e.g. ‘flex elbow in’), and no overt movement for the other arm movements (e.g. ‘close hand’, ‘thumb up’). We analyzed neural data from 200 ms to 600 ms after the go cue. We note that insofar as there was somatosensory and proprioceptive feedback only during the actualized movements, this would be expected to increase the observed neural modulation to orofacial movements and speaking, and decrease the modulation to attempted arm and hand movements. The go cue stayed on for 1500 ms. This was followed by a return period in which the text changed to ‘Return’; during this epoch, the participant was instructed to return his body to a neutral posture. Thirty-two trials were collected for each movement type. We refer to this as the ‘T5-comparisons’ dataset.

Neural recording

Both participants had two 96-electrode Utah arrays (1.5 mm electrode length, Blackrock Microsystems) neurosurgically placed in dorsal ‘hand knob’ area of the left (motor dominant) hemisphere’s motor cortex. Surgical targeting was stereotactically guided based on prior functional and structural imaging (Yousry et al., 1997), and subsequently confirmed by review of intra-operative photographs. T5 and T8 had arrays placed 14 and 34 months, respectively, prior to the present study’s prompted words, syllables, and orofacial movements tasks. The T5-breathing and T5-comparisons datasets were recorded 26 months after array placement, the T5-5words-A and T5-5words-B datasets were recorded 28 months after array placement, and the T5-phonemes dataset was recorded 29 months after array placement. Arrays were placed in areas anticipated to have arm movement-related activity because two goals of the clinical trial are 1) testing the feasibility of intracortical BCI-based communication using point-and-click keyboards and 2) restoration of reach and grasp function via control of a robotic arm or functional electrical stimulation. We note that these implant sites are distinct from the closest known speech area, which is the dorsal laryngeal motor cortex (Bouchard et al., 2013; Dichter et al., 2018). In this study, we looked for neural correlates of speaking in dorsal motor cortex. To help contextualize the results, here we summarize the other behaviors associated with modulation of the neural activity recorded by these same arrays. Our previous studies have reported that T5 and T8 controlled BCI computer cursors by attempting movements of their arm and hand (Brandman et al., 2018; Pandarinath et al., 2017). T8 was also able to use intended arm movements to command movements of his own paralyzed arm via functional electrical stimulation (Ajiboye et al., 2017). We also recorded movement task outcome error signals from T5’s arrays; these signals indicated whether the participant succeeded or failed at acquiring a target using a BCI-controlled cursor (Even-Chen et al., 2018).

Neural signals were recorded from the arrays using the NeuroPort system (Blackrock Microsystems). Voltage was measured between each of the 96 electrodes’ uninsulated tips and that array’s reference wire. Wire bundles ran from each array to cranially-implanted connector pedestals. During research sessions, a ‘patient cable’ with a unity gain pre-amplifier was connected to each array’s corresponding pedestal and carried signals to an isolated unity gain front-end amplifier. These signals were analog filtered from 0.3 Hz to 7.5 kHz, digitized at 30 kHz (250 nV resolution), and sent to the neural signal processor via fiber-optic link. As mentioned earlier, amplified analog voltage data from the microphone were input to the neural signal processor and were digitized time-locked with the neural signals. All these digitized data were sent over a local network to a connected PC where they were recorded to disk for subsequent analysis.

The naming scheme for neurons or electrodes in figures is <participant>_<array #>.<electrode #>. For example, 'neuron T5_2.4' in Figure 1 refers to a participant T5 neuron identified on the second array (which is the more medial of each participant’s two arrays) on electrode #4 (according to the manufacturer’s electrode numbering scheme).

For both participants, we did not observe major differences between the two arrays, and we confirmed that the neural population analyses results (ensemble modulation to speech/movements/breathing, phoneme neural correlate similarities, speech decoding, condition-invariant and rotatory population dynamics) were similar when data from each array were analyzed separately. We therefore pool together data from both arrays in all the presented results.

Neural signal processing

Neuronal action potentials (spikes) were detected as follows. We first applied a common average re-referencing to each electrode within an array by subtracting, at each time sample, the mean voltage across all electrodes on that array. These voltage signals were then filtered with a 250 Hz asymmetric FIR high-pass filter designed to extract spike activity from this type of array (Masse et al., 2014). To measure single unit activity (SUA), time-varying voltages were manually ‘spike sorted’ by an experienced neurophysiologist using Plexon Offline Spike Sorter v3. This process identified action potentials belonging to putative individual neurons amongst the high amplitude voltage deviation events. Occasionally, the same action potential can be recorded on multiple electrodes (this could happen if a neuron is very large, if an axon passes multiple electrodes, or if there is some electrical cross-talk in the recording hardware). To prevent creating duplicate single neuron units, we excluded ‘cross-talk units’ if their spike time series (using 1 ms binning) had greater than 0.5 correlation with another unit’s. When this happened, we kept the unit with the better spike sorting isolation. Unless otherwise stated, time-varying firing rate plots, also known as peristimulus time histograms (such as in Figure 1D) were constructed by smoothing spike trains with a 25 ms s.d. Gaussian kernel and averaging continuous-valued firing rates across trials of the same behavioral condition.

Spike sorting allows us to make statements about the properties of individual motor cortical neurons (for example, how many syllables they modulate to, as in Figure 1—figure supplement 4B). However, a limitation of spike sorting is that action potential event ‘clusters’ with insufficient isolation from other clusters are discarded. For chronic multielectrode array recordings, this can mean that activity recorded from the majority of electrodes is not analyzed, despite these neural signals having a strong relationship with the behavior of interest. This problem is particularly acute in human neuroscience, where replacing arrays, or using newer methods that provide a higher SUA yield (for example high-density probes or optical imaging), is not currently possible. Relaxing the constraint that action potential events must be unambiguously from the same neuron and instead analyzing voltage threshold crossings (TCs) is an effective way to substantially increase the information yield of chronic electrode arrays. In this study, we examined TCs in a number of analyses. Decoding TCs or other non-SUA signals has become standard practice in the intracortical BCI field (e.g. Ajiboye et al., 2017; Brandman et al., 2018; Collinger et al., 2013; Even-Chen et al., 2018; Pandarinath et al., 2017). This method also provides information about the dynamics of the neural state (i.e. it can be used to make scientific statements about ensemble activity under many conditions) despite combining spikes that may arise from one or more neurons; we provide empirical and theoretical justifications in Trautmann et al. (2019). In the present study, when we refer to an ‘electrode’s’ firing rate, we mean TCs recorded from that electrode. When we refer to a neuron’s firing rate, we mean sorted single unit activity. Figure 1—figure supplement 2 shows example TCs firing rates, including from the same electrodes that the example neurons in Figure 1 were sorted from.

A threshold of −4.5 × root mean square (RMS) voltage was used for all analyses and visualizations except for the t-SNE visualization and decoding analyses shown in Figure 3. This threshold choice is somewhat arbitrary but is conservative; it accepts large voltage deviations indicative of action potentials from one or a few neurons near the electrode tip. For the Figure 3 analyses, we used a more relaxed threshold of −3.5 × RMS because we found that this led to slightly better classification performance in a separate pilot dataset (consisting of T5 speaking five words and syllables, collected a month prior to the datasets reported here) which we used for choosing hyperparameters. The better performance of a less restrictive voltage threshold is consistent with collecting more information by accepting spikes from a potentially larger pool of neurons (Oby et al., 2016). This trade-off was acceptable because for these engineering-minded decoding analyses, we were less concerned about the possibility of missing tuning selectivity or fast firing rate details due to combining spikes from more neurons.

Electrodes with TCs firing rates of less than 1 Hz (at a −4.5 × RMS threshold) were considered non-functioning and were excluded from analyses unless there was well-isolated SUA on the electrode. This electrode exclusion applied to both spikes and the local field potential signal described below. Electrodes having TCs time series with greater than 0.5 correlation with another electrode’s were marked for cross-talk de-duplication. To determine which electrode to keep, we chose the one that had the fewest spikes co-occurring (1 ms bins) with the other electrode(s)’ (i.e. we kept the electrode with putatively more unique information).

For the neural decoding analyses (Figure 3), we also extracted a high-frequency local field potential (HLFP) feature from each electrode by taking the power of the voltage after filtering from 125 to 5000 Hz (third-order bandpass Butterworth causal filtering forward in time). HLFP is believed to contain substantial power from action potentials (Waldert et al., 2013); we view this feature as capturing spiking ‘hash’, that is multiunit activity local to the electrode with contributions from smaller-amplitude and more distant action potentials than TCs. Our previous study found that this signal is highly informative about hand movement intentions and is useful for real-time BCI applications (Pandarinath et al., 2017). This feature has some similarities to the ‘high gamma’ activity examined by ECoG studies; the definition of high gamma varies in exact frequency from study to study, but generally has a lower cutoff between 65 and 85 Hz and an upper cutoff between 125 and 250 Hz (Bouchard et al., 2013; Chartier et al., 2018; Cheung et al., 2016; Dichter et al., 2018; Martin et al., 2014; Mugler et al., 2014; Ramsey et al., 2018). However, the intracortical HLFP in this study should not be viewed as being the exact same as ECoG high gamma activity due to differences in electrode location, electrode geometry, and HLFP’s higher frequency range.

Task-related neural modulation

To quantify which electrodes’ spiking activity changed during speaking (Figure 1B insets, Figure 1—figure supplement 4), we calculated each electrode’s mean firing rate from 0.5 s before to 0.5 s after AO, yielding one datum per electrode, per trial. For each syllable, a rank-sum test was then used to determine whether there was a significant change in the distribution of single-trial firing rates when speaking the syllable compared to the silent condition (p<0.05, Bonferroni corrected for the number of syllables). To identify which electrodes responded to orofacial movements (Figure 2, Figure 2—figure supplement 1) we performed a similar analysis, except that the analysis epoch was from 0.5 s before to 0.5 s after the go cue. This epoch captures strong modulation, as can be seen by the example firing rate plots in Figure 2. We note that firing rate changes preceding the go cue indicate either substantial movement preparation activity, or that the participants were ‘jumping the gun’ and started moving in anticipation of the go cue; either way, this response indicates modulation related to making orofacial movements. In lieu of a silent condition, the movement conditions’ firing rate distributions were compared to that of the ‘stay still’ condition. The same methods were used to quantify which single neurons’ activities changed during speaking or orofacial movements; for this, we analyzed SUA rather than electrodes’ −4.5 × RMS TCs.

Neural population modulation

To measure the differences in neural modulation across the recorded population following the audio prompt and following the go cue (‘population modulation’ in Figure 1E, Figure 1—figure supplement 3B), at each time point (aligned to either the audio prompt or the go cue) we quantified the differences between the firing rate vector for a given spoken condition yspeak (for example, the vector of firing rates across the ga syllable trials, where each element of the vector is the firing rate for one electrode) and ysilent, the firing rate vector for the silent condition. Importantly, however, we did not simply use ||yspeakysilent||, the Euclidean norm of the vector difference between these two conditions’ trial-averaged firing rates. The problem with that approach is that a vector norm always yields a non-negative value, meaning that if it is used to measure neural activity differences, the metric will be upwardly biased: it will return a positive value instead of 0 even when population firing rates for the two conditions are essentially the same. This is because estimates of firing rates for two sets of trials, even if they are drawn from the same underlying distribution (i.e. from the same behavioral context) will inevitably differ, even just slightly, resulting in a positive vector difference norm. This problem becomes worse when dealing with lower trial counts and low firing rates, and makes it difficult to distinguish weak population modulation from noise.

To avoid this issue and better estimate neural population activity differences, we used a cross-validated variant of the vector difference norm; we will refer to this metric as the ‘neural distance’. For N1 trials from condition 1 (for example, saying ga) and N2 trials from condition 2 (for example, silent trials), we calculate a less biased estimate of the squared vector norm of the difference in the two conditions’ mean firing rates using:

D=1N11N2i=1N1j=1N2[y1iy2j]T[y1{1:N1}/iy2{1:N2}/j] (1)

which leaves one trial out from each condition when calculating the differences in means. Critically, the dot product is taken between firing rates computed from fully non-overlapping sets of trials, and can be negative. To convert this to a signed distance more analogous to a Euclidean vector norm, we define the final neural distance metric as d=sign(D)D.

This cross-validated neural distance has units of Hz; much like with a standard Euclidean vector norm, having more electrodes, and these electrodes having larger firing rate differences between the two conditions, will both result in larger overall distances. Unlike a Euclidean vector norm, our population neural distance metric can produce negative values. This is required for the metric to be unbiased and should be interpreted as evidence that the true distance between the two distributions’ population firing rates is near zero. A benefit of allowing negative values is that time-averaging across an epoch of essentially no underlying firing rate differences will give a mean distance close to zero. The derivation of this metric is described in detail in Willett et al. (2019), and a software implementation is available at https://github.com/fwillett/cvVectorStats.

For statistical testing, we compared the time-average of this neural distance across two comparison epochs: a prompt epoch (0 to 1 s after the audio prompt) and a speaking epoch (0 to 1.75 s after the go cue for T5, 0.5 to 1.75 s after go for T8). We chose a later speaking epoch start for T8 to better match this participant’s delayed speech-related modulation, which could reflect less anticipatory preparation prior to the (predictable) go cue time, and/or the reduced speech-related modulation recorded on T8’s arrays. This resulted in one datum for each epoch per speech condition, for example 10 pairs of (prompt, speech) value pairs corresponding to each syllable. We compared the resulting prompt and speech epoch distributions with a Wilcoxon signed-rank test. The same procedure was used to compare the prompt epoch neural population modulation to a ‘baseline’ epoch consisting of the 1 s leading up to the audio prompt.

When we report the ratio between population modulation during the go epoch and during the prompt epoch, this ratio was computed after taking the mean modulation across all syllables/words for each epoch.

Comparing different phonemes’ neural correlates

To generate Figure 1—figure supplement 5, we first manually segmented each word spoken in the T5-phonemes dataset into its constituent phonemes using the Praat software package (Boersma and Weenink, 2019). This resulted in 3892 total phonemes. The number of occurrences across the 41 unique phonemes ranged between 14 (/ɔ/) and 239 (/t/), with a median of 80 occurrences. For each unique phoneme, we isolated a 150 ms window of TCs centered around the onset of each instance of that phoneme. This produced an (# instances) × electrodes firing rate matrix for each phoneme. We used these data matrices to calculate the neural population activity difference between all pairs of phonemes using the cross-validated neural distance metric described in the ‘Neural population modulation’ section. This resulted in the matrix of phoneme pair neural distances in Figure 1—figure supplement 5A. Within-phoneme neural distances (the diagonal elements of the distance matrix) were calculated by comparing half of the instances of a given phoneme with the other half; the distances shown are the mean distances across 20 such random splits of each phoneme.

To relate these neural distances to known differences in the speech articulator movements required to produce the phonemes, we grouped phonemes by their place of articulation as in Moses et al. (2019). We then compared within-group neural distances to between-groups neural distances (Figure 1—figure supplement 5B). Every pair of phonemes in the Figure 1—figure supplement 5A neural distance matrix contributes one datum to either the red distribution in that figure’s panel B (if the two phonemes are in the same articulatory grouping) or to the black distribution (if the two phonemes are in different groups). The exception to this is that the three phonemes that are sole members of their own lonely groups were not included in this analysis. The summary statistic of this comparison was the difference between the mean of within-group neural distances and the mean of between-groups neural distances. This statistic was compared against a null distribution built by taking the same summary metric after shuffling neural distance matrix rows and columns, repeated 10,000 times. This null distribution assumes that the phonemes are grouped arbitrarily (but with the same number and sizes of groups), and not according to place of articulation. Comparing the true within-group versus between-groups difference to this null distribution (Figure 1—figure supplement 5C) provides a p-value for rejecting the null hypothesis that phoneme neural distances are no more correlated with articulatory grouping than expected by chance.

The dendrogram shown in Figure 1—figure supplement 5D was generated by applying the widely used ‘unweighted pair group method with arithmetic mean’ (UPGMA) hierarchical clustering algorithm (Sokal and Michener, 1958) to the phoneme neural distance matrix.

Breath-related neural modulation

To generate breath-triggered firing rates (Figure 2—figure supplement 2), we first identified breath peak times from the breath belt stretch transducer measurements. The belt signals were pre-processed by removing rare outlier values (>50 μV difference between consecutive samples) and then low-pass filtering (3 Hz pass-band) the signal both forwards and backwards in time to avoid introducing a phase shift. An example of this filtered signal is shown in Figure 2—figure supplement 2A. Breath peaks were then found using the MATLAB findpeaks function, with key parameters of MinPeakDistance = 1 s, and MinPeakProminence = 0.3⋅B, where B is the median of all peak prominences found by first running findpeaks using MinPeakDistance = 5 s (in other words, we required a peak to be at least 30% of the prominence of the ‘big’ peaks in the data).

Breath peak-aligned firing rates were calculated by treating each identified breath peak as one trial, and trial averaging across neural snippets aligned to each breath peak time. Each TCs’ or SUA’s breath-related modulation depth was defined as the maximum – minimum firing rate observed in the interval from 2 s before the breath peak to 1.5 s after the breath peak. To calculate whether a given modulation depth was statistically significant, we used a shuffle control in which we compared the true data’s modulation depth to the distribution of modulation depths observed over 1001 random shuffles in which faux peak breath times were uniformly drawn from the data duration. For comparing breath-related and speaking-related modulation depths (Figure 2—figure supplement 2F), we defined a given electrode’s speech modulation depth in the T5-syllables dataset as its maximum – minimum firing rate from 2.5 s before acoustic onset to 1 s after acoustic onset.

Arm and hand versus orofacial and speaking movements comparisons

The neural ensemble modulation comparisons presented in Figure 2—figure supplement 3 were calculated as follows: mean TCs firing rates for each T5-comparisons dataset instructed movement condition were calculated for each electrode from 200 to 600 ms after the go cue. The resulting firing rate vectors were compared to firing rate vectors similarly constructed from the ‘do nothing’ condition. Modulation was calculated by taking the unbiased neural distance between these firing rate vectors as described above in the ‘Neural population modulation’ section.

Single-trial low-dimensional neural state projections

To visualize single-trial high dimensional neural data (Figure 3A), we used t-distributed stochastic neighbor embedding (tSNE), a dimensionally reduction technique which seeks to represent high-dimensional vectors (such as our time-varying, multielectrode neural data) in a low-dimensional space (such as a 2D plot that can be easily visualized). The tSNE algorithm finds a nonlinear mapping such that similar high-dimensional feature vectors end up close together in the low-dimensional view, while dissimilar vectors end up far apart (Van Der Maaten and Hinton, 2008). A neural feature vector was constructed for each trial as follows: for each functioning electrode, spike rates and HLFP power were calculated in ten 100 ms bins that spanned from 0.5 s before to 0.5 s after AO. These features were concatenated into a vector; for example, for the T5-syllables dataset, a single trial’s neural data were represented as a 104 electrodes × 2 features per electrode ×10 time bins = 2080 dimensional vector. All trials’ feature vectors were then projected into a 2D space using the tsne function in MATLAB R2017b’s Statistics and Machine Learning Toolbox with NumDimensions = 2; Perplexity = 15 (this is the number of local neighbors examined for each datum); Algorithm = exact (suitable for our relatively small dataset); and Standardize = true (this z-scores the input data, which was desirable due to the variability between different electrodes and the vastly different scales between spike rates and HLFP power). All other algorithm parameters were set to their defaults. Figure 3A does not have axis labels because t-SNE does not return meaningful axes or units; only the relative distances between points have meaning.

Speech decoding

We evaluated how well the identity of the syllable or word being spoken could be decoded from neural data by classifying single trial neural data. Neural feature vectors were constructed for each trial as described above. These vectors were then associated with a class label, which was the sound being spoken (i.e. word, syllable, or silence). We trained support vector machines (SVMs), a standard classification tool, to predict the class label from a ‘new’ neural feature vector which the classifier had not been trained on. Prediction accuracies were cross-validated using a leave-one-trial-out paradigm in which the classifier was trained on all trials except the trial being classified, and this was repeated for all trials in a dataset. Multiclass classification was achieved using the error-correcting output code (ECOC) technique, which trains multiple binary SVMs between all pairs of labels, that is a one-versus-one coding design. When classifying new input data, the ECOC technique picks the class that minimizes the sum of losses over the set of binary SVM classifiers. Specifically, we used MATLAB R2017b’s implementation: a multiclass model object was fit (fitcecoc) using the SVM template (templateSVM). Key parameters were to use a linear kernel; OutlierFraction = 0.05 (expecting 5% of data points to be outliers); and Standardize = true (which z-scores the neural features based on the training data). All other parameters were set to their default values. We note that we did not heavily optimize our classification method; rather, our goal here was to use a standard tool to gauge the classification performance that these intracortical neural signals support. More sophisticated machine learning techniques (e.g. Angrick et al., 2019; Livezey et al., 2019) are likely to provide additional improvements.

To measure chance prediction performance, we used a shuffle test in which we randomly permuted the class labels associated with all trials’ neural data. The same classifier training and leave-one-out prediction process was then repeated on these shuffled data 101 times.

Neural population dynamics

An underlying motivation for the neural population dynamics analyses described in the next several sections is the idea that the activity of many thousands or millions of neurons in a circuit (of which we can only measure on the order of 100 neurons in humans with current technology) can be summarized by the time-varying activity of a handful of latent ‘components’. In this framing, individual neurons’ firing rates reflect various mixtures of these underlying components; in all the analyses we used, this mapping from components to firing rates is assumed to be linear. These components are not meant as discrete physical ‘things’ in the brain, but rather are mathematical abstractions which capture meaningful patterns in the activities of networks of neurons. They are useful insofar as they can help generate hypotheses about the computations neural populations are performing by describing their prominent activity patterns. To this end, not only can latent components succinctly describe the ‘neural state’ (i.e. the firing rate of all neurons at a given moment in time), but furthermore, the time evolution of these components is often more conducive to interpretation and understanding than more complex descriptions of all the individual neurons’ firing rates.

Here, we built on previous studies showing that these components’ changes over time can be effectively modeled as a lawful time-varying oscillatory dynamical system (Churchland et al., 2012; Pandarinath et al., 2015), and that they reveal a simple population-level pattern in which there is a stereotyped response at the initiation of many different movements (Kaufman et al., 2016). This ‘dynamical system’ framework is extensively reviewed in Shenoy et al. (2013) as well in the two key studies that inspire the neural population dynamics analyses of the present study (Churchland et al., 2012; Kaufman et al., 2016). We looked for the aforementioned dynamical motifs using two different dimensionality reduction techniques that were specifically designed to reveal the presence (or absence) of these population dynamics features.

For these analyses, we primarily examined the prompted word speaking task datasets because this was a more naturalistic behavior than the prompted syllables speaking task. Participants reported that it was more difficult to discriminate syllables than words, and that speaking stand-alone syllables felt somewhat awkward, whereas saying words was easy. Consequently, a practical benefit of the words task over the syllables task is that behavior was more stereotyped across trials, which facilitates precise trial-averaging, and there were very few mis-heard or mis-spoken words. Results for the same analyses applied to the syllables task data are shown in Figure 4—figure supplement 1.

Both of these neural population state analyses were performed on TCs, which contained more information about the neural population state than the more limited number of recorded SUA. All electrodes with TCs firing rates greater than 1 Hz were included. The Churchland-Cunningham and Kaufman studies analyzed a combination of both SUA from single-electrode recordings and TCs from multielectrode recordings, depending on the dataset, while Pandarinath et al. (2015) also analyzed just TCs. To avoid cumbersome switching of terms when describing our methods and comparing them to those of these previous studies, we will use the generic term ‘unit’ to refer to a single channel of neural information, whether it be SUA or TCs.

Condition-invariant signal

The first population dynamics motif we tested for was a specific form of population-level structure at the initiation of movement: a large condition-invariant signal, previously described in Kaufman et al. (2016). We closely followed Kaufman and colleagues’ analysis methods, adapting them as necessary for these human speaking datasets. As in Kaufman et al. (2016), spike trains were trial-averaged within a behavioral condition (in our case, speaking one of the 10 different words), smoothed with a 28 ms s.d. Gaussian, and ‘soft normalized’ with a 5 Hz offset. Normalization means that each unit’s firing rate was normalized by its range across all times and conditions. This prevents units with very high firing rates from dominating the estimate of neural population state (Pandarinath et al., 2018). The ‘soft’ refers to adding an offset (5 Hz in these analyses) to the denominator to reduce the influence of units with very small modulation. Trial-averaged firing rates were calculated from a speech initiation epoch of 200 ms before go cue to 400 ms after the go cue for T5, and 100 ms to 700 ms after the go cue for T8. T8’s epoch was shifted later relative to T5’s to account for T8’s later neural population activity divergence from the silent condition (Figure 1—figure supplement 3B). This yields a N × C × T data tensor, where N is the number of units, C is the number of word conditions (10), and T is the number of time samples (600, using 1 ms sliding bins).

We used demixed principal components analysis (dPCA), a dimensionality-reduction technique developed by Kobak et al. (2016), to look for condition-invariant activity patterns in these high-dimensional neural recordings. This dimensionality reduction method is conceptually similar to PCA, in that it finds a specified number of dPC ‘components’ that can be thought of as ‘building blocks’ from which the responses of individual units can be composed. As with PCA, dPCA attempts to compress the data by identifying dimensions that capture a large fraction of the variance. This takes advantage of the fact that unless the responses of neurons are all independent from one another (which in practice is not the case), then most of the variance of the full population response can be accurately reconstructed as a weighted sum of a smaller number of dPC components. Where dPCA differs from PCA is that it can explicitly attempt to find components that marginalize variance attributable to different parameters of the experiment (such as time or task variables). This is possible because dPCA is a supervised method that trades off finding dimensions that maximize variance in favor of finding dimensions that partition the variance based on labeled properties of the data.

In our case, this ‘demixing’ was attempted between: 1) condition and condition + time interactions, which together form the condition-dependent (CD) components of the neural population activity; and 2) time only, which forms condition-invariant (CI) components. In other words, dPCA sought a set of components of the population activity for which the time-varying neural responses during producing different words look the same, and also for another set of components which vary across speaking conditions (i.e. are ‘tuned’ for what word is being spoken). Importantly, such variance marginalization (i.e. demixing the parameters) may not be achievable; it depends on the structure of the data itself. Each component that dPCA returns is associated both with how much overall neural variance it captures (the lengths of the bars in Figure 4A), and how much of this variance is CI or CD (red and blue fraction of each bar, respectively). Thus, the success of this demixing can be examined based on how purely CI or CD each component is. This in turn reveals whether there exists a large and almost completely condition-invariant component of the population neural activity.

Kaufman and colleagues used an earlier version of the dPCA method and code package, called ‘dPCA-2011’ (Brendel et al., 2011). We used the MATLAB implementation of ‘dPCA-2015’ (Kobak et al., 2016), downloaded from https://github.com/machenslab/dPCA. This is an updated, improved, and widely adopted version of the technique which was not yet available at the time when the Kaufman et al. (2016) analyses were performed. We specified that dPCA should return eight total components, which was less than then 10 to 12 used in Kaufman et al. (2016). This reflects the reduced complexity of our datasets, in the sense that they had fewer conditions (10 versus 27–108) and fewer units (96–106 versus 116–213). We also repeated the analyses using 2 to 12 dPCs and observed very similar results. Default dpca function parameters were used, with parameters numRep = 10 (repetitions for regularization cross-validation) and simultaneous = true (indicating that the single-trial neural data were simultaneously recorded across electrodes) for the dpca_optimizeLambda and dpca_getNoiseCovariance functions.

Unlike the dPCA-2011 used by Kaufman et al. (2016), dPCA-2015 does not enforce that the neural dimensions found for capturing variance attributable to different parameters (here, the CI and CD components) be orthogonal. For example, while the first three (largely CI) components for T5 in Figure 4A are orthogonal by construction (as are the five largely CD components), these CI and CD components need not be orthogonal. We quantified the angles between the demixed principal axes (the dPCA encoder dimensions), and the (related but distinct degree of correlation between the resulting dPCA components, using the methods described in Kobak et al. (2016) and implemented in the dPCA code pack. Unlike Kobak et al. (2016), we used a p-value threshold of 0.01 rather than 0.001 for the Kendall rank correlation coefficient test between each pair of dimensions’ electrode weightings vectors. This means that we were more conservative in the sense that we were more likely to flag neural dimensions as non-orthogonal. For measuring the angle between the CIS1 dimension and the first jPC plane (Figure 4—figure supplement 1E), we used the subspacea package for MATLAB, downloaded from https://www.mathworks.com/matlabcentral/fileexchange/55-subspacea-m (Knyazev and Argentati, 2002). To test whether the CIS1 was significantly non-orthogonal to each of the jPCA dimensions individually, we used the same Kendall rank correlation test as described above.

Rotatory dynamics

The second form of neural population structure we tested for was rotatory (i.e. oscillatory) low-dimensional dynamics. We applied methods previously developed to identify and quantify rotatory dynamics in motor cortex during NHP arm reaching (Churchland et al., 2012). These methods were also recently applied to show rotatory dynamics during hand movements of BrainGate2 study participants (Pandarinath et al., 2015). Churchland, Cunningham and colleagues introduced the jPCA dimensionality reduction technique for this purpose; we employed their MATLAB analysis package, downloaded from https://churchland.zuckermaninstitute.columbia.edu/content/code.

Trial-averaged firing rates for each word speaking condition were generated from 150 ms before to 100 ms after acoustic onset to capture an epoch when speech-producing articulator movements were being produced. Following Churchland et al. (2012) and Pandarinath et al. (2015), these firing rates were soft-normalized with a 10 Hz offset and smoothed with a Gaussian kernel; we used a 30 ms s.d. kernel as in Pandarinath et al. (2015). These firing rates were ‘centered’ by subtracting the across-condition mean firing rate of each unit at each time point, and then sampled every 10 ms. The dimensionality of these data was reduced via PCA to six; this ensured that rotatory dynamics would be sought within population activity components that were strongly present in the data. jPCA was then used to find planes with rotatory structure within this six-dimensional subspace. The jPCs are found by fitting the following linear dynamical system:

x˙=Mskewx (2)

where x is the neural state (i.e. the PCA dimensionality-reduced population firing rate) at a given time, x˙ is its time derivative, and Mskew is constrained to be a skew-symmetric matrix. The first jPCA plane, which has the strongest rotatory dynamics, is defined by the two complex eigenvectors of Mskew with the largest eigenvalues. The choice of real vectors jPC1 and jPC2 within this plane is arbitrary and, following convention, were chosen such that conditions’ activities are maximally spread along jPC1 at the start of the analysis epoch. Figure 5A plots the trial-averaged population activity during speaking each word (after subtracting the across-conditions mean) in this top jPCA plane. The red/black/green color of each word condition’s neural trajectory corresponds to its projection along jPC1 at the start of the epoch; this display style is intended to assist in observing that amplitude and phase tend to unfold lawfully from the initial neural state. It is worth emphasizing that each jPC is simply a linear weighting of different units’ firing rates, and that the six jPCs form an orthonormal basis set that spans the same subspace as the top six PCs. The strength of rotatory dynamics was quantified as the goodness of fit for Equation 2 for a 2 × 2 Mskew in the first jPCA plane, and for a 6 × 6 Mskew in the 6-dimensional subspace defined by the top 6 PCs of the data. Figure 5B reports this 6D fit quality.

Statistical testing of rotatory dynamics

To calculate the statistical significance of rotatory population dynamics structure in our data, we applied the ‘neural population control’ approach developed by Elsayed and Cunningham (Elsayed and Cunningham, 2017). This method was developed to address a potential concern that many specific phenomena that an experimenter could test for (such as fitting low-dimensional rotatory dynamics to neural data) can be found ‘by chance’ in a sufficiently high-dimensional, complex dataset such as the time-varying firing rates of many neurons. To address this, the method tests whether an observed feature of the population activity is ‘novel’ in the sense that it cannot be trivially predicted from known simpler features in the data. This is achieved by constructing surrogate datasets with simple population structure (in the form of means and correlations across time, neurons, and behavioral conditions) matched to the real data. If the neural recordings contain population-level structure that is coordinated above and beyond these first- and second-order features, then the quantification method used to describe this structure should return a stronger read-out when applied to the original dataset than to the surrogate datasets.

In our case, we used this approach to test whether it is ‘surprising’ to see rotatory dynamics in neural population data, given the particular smoothness across time, units, and word speaking conditions present in these data. A similar approach was used in Elsayed and Cunningham (2017) to further validate the original rotatory dynamics finding of Churchland et al. (2012). We used the MATLAB code associated with Elsayed and Cunningham (2017) from https://github.com/gamaleldin/TME to generate 1000 surrogate datasets with time, neuron, and condition means and covariance matched to the real data using the tensor maximum entropy algorithm (‘surrogate-TNC’ flag in fitMaxEntropy). We then ran the same jPCA analyses described above on these surrogate datasets and recorded the rotation dynamics goodness of fit for the best Mskew matrix found for each surrogate dataset. This distribution of surrogate dataset R2 values serves as a null distribution for significance testing: we calculated a p-value by counting how many of the surrogate datasets’ R2 exceeded that of the true original dataset.

Neural state trajectory videos

The goal of Videos 2 and 3 is to visualize how participant T5’s neural population activity undergoes a condition-invariant ‘kick’ after the go cue (Figure 4) followed by rotatory dynamics around acoustic onset (Figure 5). To do so, we projected the ensemble neural activity during speaking short words into a lower dimensional neural state space designed to capture both the prominent condition-invariant component (hence, CIS1 is one of the three projection dimensions) and rotatory dynamics (hence, the remaining two dimensions are the top jPCA plane). Plotting the word conditions’ neural state trajectories in the same state space required harmonizing the slightly different pre-processing used in the dPCA (Figure 4) and jPCA (Figure 5) analyses. Specifically, the trial-averaged neural trajectories in these videos were generated using the 30 ms s.d. Gaussian smoothing and 5 Hz soft-normalization parameters from the dPCA analysis. The CIS1 dimension was found by applying dPCA to the same time epoch as in Figure 4 (200 ms before go to 400 ms after go), and the jPC1 and jPC2 dimensions were found by applying jPCA to the same time epoch as in Figure 5 (150 ms before to 100 ms after acoustic on).

To facilitate viewing the neural state trajectories in three (orthogonal) dimensions consisting of [CIS1, jPC1, jPC2], for these videos only we enforced that jPC1 and jPC2 be orthogonal to CIS1 (empirically, without this constraint the top jPCA plane was 75° from the CIS1, as shown in Figure 4—figure supplement 1E). To do so, prior to running jPCA, the trial-averaged firing rates were projected into the null space of the CIS1 (the orthogonal complement of the first column of the encoder matrix returned by dPCA). That is, instead of jPCA operating on the E = 96 electrodes firing rates, it operated on a 96−1 = 95 dimensional projection of the firing rates. The overall consequence of these decisions is that in these videos, the neural state is projected onto the exact same CIS1 dimension as in Figure 4, whereas the jPC1 and jPC2 dimensions differ slightly from Figure 5 due to the aforementioned spike train pre-processing differences and CIS1 orthogonalization.

Acknowledgements

We thank participants T5, T8, and their caregivers for their dedicated contributions to this research; Nancy Lam for administrative support; Dr. Sydney Cash and Dr. Laura Ball for helpful discussions; and Dr. Marc Slutzky for helpful discussions and providing the list of many words for sampling many distinct phonemes.

This work was supported by an ALS Association Milton Safenowitz Postdoctoral Fellowship, A. P. Giannini Foundation Postdoctoral Fellowship, Wu Tsai Neurosciences Institute Interdisciplinary Scholar Award, and Burroughs Wellcome Fund Career Award at the Scientific Interface (SDS); NSF Graduate Research Fellowship DGE – 1656518 and Regina Casper Stanford Graduate Fellowship (GHW); Larry and Pamela Garlick, Samuel and Betsy Reeves (KVS, JMH); NIDCD R01DC014034 (JMH); Office of Research and Development, Rehabilitation R and D Service, Department of Veterans Affairs N9288C, A2295R, B6453R, Executive Committee on Research of Massachusetts General Hospital, NIDCD R01DC009899 (LRH); NICHD R01HD077220 (RFK); NINDS 5U01NS098968-02 (LRH); Howard Hughes Medical Institute (KVS). The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Sergey D Stavisky, Email: sergey.stavisky@gmail.com.

Tamar R Makin, University College London, United Kingdom.

Barbara G Shinn-Cunningham, Carnegie Mellon University, United States.

Funding Information

This paper was supported by the following grants:

  • ALS Association Milton Safenowitz Postdoctoral Fellowship 17-PDF-364 to Sergey D Stavisky.

  • A.P. Giannini Foundation Postdoctoral Research Fellowship to Sergey D Stavisky.

  • Wu Tsai Neurosciences Institute Interdisciplinary Scholar Award to Sergey D Stavisky.

  • Burroughs Wellcome Fund Career Award at the Scientific Interface to Sergey D Stavisky.

  • National Science Foundation Graduate Research Fellowships Program DGE - 1656518 to Guy H Wilson.

  • Regina Casper Stanford Graduate Fellowship DGE - 1656518 to Guy H Wilson.

  • Eunice Kennedy Shriver National Institute of Child Health and Human Development R01HD077220 to Robert F Kirsch.

  • U.S. Department of Veterans Affairs Office of Research and Development, Rehabilitation R&D Service N9228C to Leigh R Hochberg.

  • U.S. Department of Veterans Affairs Office of Research and Development, Rehabilitation R&D Service B6453R to Leigh R Hochberg.

  • U.S. Department of Veterans Affairs Office of Research and Development, Rehabilitation R&D Service A2295R to Leigh R Hochberg.

  • U.S. Department of Veterans Affairs Office of Research and Development, Rehabilitation R&D Service N2864C to Leigh R Hochberg.

  • National Institute on Deafness and Other Communication Disorders R01DC009899 to Leigh R Hochberg.

  • Executive Committee on Research of Massachusetts General Hospital to Leigh R Hochberg.

  • National Institute of Neurological Disorders and Stroke 5U01NS098968-02 to Leigh R Hochberg.

  • Larry and Pamela Garlick to Krishna V Shenoy, Jaimie M Henderson.

  • Samuel and Betsy Reeves to Krishna V Shenoy, Jaimie M Henderson.

  • National Institute on Deafness and Other Communication Disorders R01DC014034 to Jaimie M Henderson.

  • Howard Hughes Medical Institute to Krishna V Shenoy.

Additional information

Competing interests

No competing interests declared.

The MGH Translational Research Center has clinical research support agreements with Paradromics and Synchron Med, for which LRH provides consultative input. LRH is also a consultant for Neuralink.

is a consultant for Neuralink Corp and on the scientific advisory boards of CTRL-Labs Inc, MIND-X Inc, Inscopix Inc, and Heal Inc.

is a consultant for Neuralink Corp, Proteus Biomedical and Boston Scientific, and serves on the Medical Advisory Boards of Enspire DBS and Circuit Therapeutics.

Author contributions

Conceptualization, Data curation, Software, Formal analysis, Funding acquisition, Investigation, Visualization, Methodology.

Software, Formal analysis.

Formal analysis, Visualization.

Investigation.

Investigation.

Investigation.

Investigation, Project administration.

Investigation.

Supervision, Funding acquisition, Project administration.

Supervision, Funding acquisition, Project administration.

Supervision, Funding acquisition.

Formal analysis, Supervision.

Conceptualization, Supervision, Funding acquisition, Project administration.

Conceptualization, Supervision, Funding acquisition, Project administration.

Ethics

Clinical trial registration NCT00912041.

Human subjects: The two participants in this study were enrolled in the BrainGate2 Neural Interface System pilot clinical trial (ClinicalTrials.gov Identifier: NCT00912041). The overall purpose of the study is to obtain preliminary safety information and demonstrate proof of principle that an intracortical brain-computer interface can enable people with tetraplegia to communicate and control external devices. Permission for the study was granted by the U.S. Food and Drug Administration under an Investigational Device Exemption (Caution: Investigational device. Limited by federal law to investigational use). The study was also approved by the Institutional Review Boards of Stanford University Medical Center (protocol #20804), Brown University (#0809992560), University Hospitals of Cleveland Medical Center (#04-12-17), Partners HealthCare and Massachusetts General Hospital (#2011P001036), and the Providence VA Medical Center (#2011-009). Both participants gave informed consent to the study and publications resulting from the research, including consent to publish photographs and audiovisual recordings of them.

Additional files

Source data 1. Breathing data.
elife-46015-data1.zip (24.2MB, zip)
Source data 2. Classification data.
elife-46015-data2.zip (13.3MB, zip)
Source data 3. Dynamics data.
elife-46015-data3.zip (22.7MB, zip)
Source data 4. PSTHs sorted units data.
elife-46015-data4.zip (49.7MB, zip)
Source data 5. Syllables PSTHS TCs data.
elife-46015-data5.zip (90.5MB, zip)
Source data 6. Tuning and behavior data.
elife-46015-data6.zip (13.9MB, zip)
Source data 7. Video go aligned data.
elife-46015-data7.zip (73.8MB, zip)
Source data 8. Video and dynamics speak aligned data.
elife-46015-data8.zip (74.8MB, zip)
Source data 9. Words PSTHs TCs data.
elife-46015-data9.zip (93.3MB, zip)
Transparent reporting form

Data availability

The sharing of the raw human neural data is restricted due to the potential sensitivity of this data. These data are available upon request to the senior authors (KVS or JMH). To respect the participants' expectation of privacy, a legal agreement between the researcher's institution and the BrainGate consortium would need to be set up to facilitate the sharing of these datasets. Processed data is provided as source data, and analysis code is available at https://github.com/sstavisk/speech_in_dorsal_motor_cortex_eLife_2019 (copy archived at https://github.com/elifesciences-publications/speech_in_dorsal_motor_cortex_eLife_2019).

References

  1. Ajiboye AB, Willett FR, Young DR, Memberg WD, Murphy BA, Miller JP, Walter BL, Sweet JA, Hoyen HA, Keith MW, Peckham PH, Simeral JD, Donoghue JP, Hochberg LR, Kirsch RF. Restoration of reaching and grasping movements through brain-controlled muscle stimulation in a person with tetraplegia: a proof-of-concept demonstration. The Lancet. 2017;389:1821–1830. doi: 10.1016/S0140-6736(17)30601-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Akbari H, Khalighinejad B, Herrero JL, Mehta AD, Mesgarani N. Towards reconstructing intelligible speech from the human auditory cortex. Scientific Reports. 2019;9:874. doi: 10.1038/s41598-018-37359-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Allen WE, Chen MZ, Pichamoorthy N, Tien RH, Pachitariu M, Luo L, Deisseroth K. Thirst regulates motivated behavior through modulation of brainwide neural population dynamics. Science. 2019;364 doi: 10.1126/science.aav3932. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Angrick M, Herff C, Mugler E, Tate MC, Slutzky MW, Krusienski DJ, Schultz T. Speech synthesis from ECoG using densely connected 3D convolutional neural networks. Journal of Neural Engineering. 2019;16:036019. doi: 10.1088/1741-2552/ab0c59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Anumanchipalli GK, Chartier J, Chang EF. Speech synthesis from neural decoding of spoken sentences. Nature. 2019;568:493–498. doi: 10.1038/s41586-019-1119-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Boersma P, Weenink D. Praat: doing phonetics by computer 2019
  7. Bouchard KE, Mesgarani N, Johnson K, Chang EF. Functional organization of human sensorimotor cortex for speech articulation. Nature. 2013;495:327–332. doi: 10.1038/nature11911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bouchard KE, Chang EF. Control of spoken vowel acoustics and the influence of phonetic context in human speech sensorimotor cortex. Journal of Neuroscience. 2014;34:12662–12677. doi: 10.1523/JNEUROSCI.1219-14.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Brandman DM, Hosman T, Saab J, Burkhart MC, Shanahan BE, Ciancibello JG, Sarma AA, Milstein DJ, Vargas-Irwin CE, Franco B, Kelemen J, Blabe C, Murphy BA, Young DR, Willett FR, Pandarinath C, Stavisky SD, Kirsch RF, Walter BL, Bolu Ajiboye A, Cash SS, Eskandar EN, Miller JP, Sweet JA, Shenoy KV, Henderson JM, Jarosiewicz B, Harrison MT, Simeral JD, Hochberg LR. Rapid calibration of an intracortical brain-computer interface for people with tetraplegia. Journal of Neural Engineering. 2018;15:026007. doi: 10.1088/1741-2552/aa9ee7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Brendel W, Romo R, Machens CK. Advances in Neural Information Processing Systems. MIT Press; 2011. Demixed Principal Component Analysis; pp. 2654–2662. [Google Scholar]
  11. Breshears JD, Molinaro AM, Chang EF. A probabilistic map of the human ventral sensorimotor cortex using electrical stimulation. Journal of Neurosurgery. 2015;123:340–349. doi: 10.3171/2014.11.JNS14889. [DOI] [PubMed] [Google Scholar]
  12. Breshears JD, Southwell DG, Chang EF. Inhibition of Manual Movements at Speech Arrest Sites in the Posterior Inferior Frontal Lobe. Neurosurgery. 2018;85:23–25. doi: 10.1093/neuros/nyy592. [DOI] [PubMed] [Google Scholar]
  13. Brumberg JS, Wright EJ, Andreasen DS, Guenther FH, Kennedy PR. Classification of intended phoneme production from chronic intracortical microelectrode recordings in speech-motor cortex. Front. Neurosci. 2011;5:1–12. doi: 10.3389/fnins.2011.00065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Carmena JM, Lebedev MA, Crist RE, O'Doherty JE, Santucci DM, Dimitrov DF, Patil PG, Henriquez CS, Nicolelis MA. Learning to control a brain-machine interface for reaching and grasping by primates. PLOS Biology. 2003;1:e42. doi: 10.1371/journal.pbio.0000042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Catani M. A little man of some importance. Brain. 2017;140:3055–3061. doi: 10.1093/brain/awx270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Chan AM, Baker JM, Eskandar E, Schomer D, Ulbert I, Marinkovic K, Cash SS, Halgren E. First-pass selectivity for semantic categories in human anteroventral temporal lobe. Journal of Neuroscience. 2011;31:18119–18129. doi: 10.1523/JNEUROSCI.3122-11.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Chan AM, Dykstra AR, Jayaram V, Leonard MK, Travis KE, Gygi B, Baker JM, Eskandar E, Hochberg LR, Halgren E, Cash SS. Speech-specific tuning of neurons in human superior temporal gyrus. Cerebral Cortex. 2014;24:2679–2693. doi: 10.1093/cercor/bht127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Chartier J, Anumanchipalli GK, Johnson K, Chang EF. Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron. 2018;98:1042–1054. doi: 10.1016/j.neuron.2018.04.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Chen PL, Hsu HY, Wang PY. Isolated hand weakness in cortical infarctions. Journal of the Formosan Medical Association. 2006;105:861–865. doi: 10.1016/S0929-6646(09)60276-X. [DOI] [PubMed] [Google Scholar]
  20. Cheung C, Hamiton LS, Johnson K, Chang EF. The auditory representation of speech sounds in human motor cortex. eLife. 2016;5:e12577. doi: 10.7554/eLife.12577. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Churchland MM, Cunningham JP, Kaufman MT, Foster JD, Nuyujukian P, Ryu SI, Shenoy KV. Neural population dynamics during reaching. Nature. 2012;487:51–56. doi: 10.1038/nature11129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Cohen MR, Maunsell JH. Attention improves performance primarily by reducing interneuronal correlations. Nature Neuroscience. 2009;12:1594–1600. doi: 10.1038/nn.2439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Collinger JL, Wodlinger B, Downey JE, Wang W, Tyler-Kabara EC, Weber DJ, McMorland AJC, Velliste M, Boninger ML, Schwartz AB. High-performance neuroprosthetic control by an individual with tetraplegia. The Lancet. 2013;381:557–564. doi: 10.1016/S0140-6736(12)61816-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Conant DF, Bouchard KE, Leonard MK, Chang EF. Human sensorimotor cortex control of directly measured vocal tract movements during vowel production. The Journal of Neuroscience. 2018;38:2955–2966. doi: 10.1523/JNEUROSCI.2382-17.2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Creutzfeldt O, Ojemann G, Lettich E. Neuronal activity in the human lateral temporal lobe. Experimental Brain Research. 1989;77:451–475. doi: 10.1007/BF00249600. [DOI] [PubMed] [Google Scholar]
  26. Cunningham JP, Yu BM. Dimensionality reduction for large-scale neural recordings. Nature Neuroscience. 2014;17:1500–1509. doi: 10.1038/nn.3776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Devlin JT, Watkins KE. Stimulating language: insights from TMS. Brain. 2007;130:610–622. doi: 10.1093/brain/awl331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Dichter BK, Breshears JD, Leonard MK, Chang EF. The control of vocal pitch in human laryngeal motor cortex. Cell. 2018;174:21–31. doi: 10.1016/j.cell.2018.05.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Elsayed GF, Cunningham JP. Structure in neural population recordings: an expected byproduct of simpler phenomena? Nature Neuroscience. 2017;20:1310–1318. doi: 10.1038/nn.4617. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Even-Chen N, Stavisky SD, Pandarinath C, Nuyujukian P, Blabe CH, Hochberg LR, Henderson JM, Shenoy KV. Feasibility of automatic error Detect-and-Undo system in human intracortical Brain–Computer Interfaces. IEEE Transactions on Bio-Medical Engineering. 2018;65:1771–1784. doi: 10.1109/TBME.2017.2776204. [DOI] [PubMed] [Google Scholar]
  31. Farrell DF, Burbank N, Lettich E, Ojemann GA. Individual variation in human motor-sensory (rolandic) cortex. Journal of Clinical Neurophysiology. 2007;24:286–293. doi: 10.1097/WNP.0b013e31803bb59a. [DOI] [PubMed] [Google Scholar]
  32. Flesher SN, Collinger JL, Foldes ST, Weiss JM, Downey JE, Tyler-Kabara EC, Bensmaia SJ, Schwartz AB, Boninger ML, Gaunt RA. Intracortical microstimulation of human somatosensory cortex. Science Translational Medicine. 2016;8:361ra141. doi: 10.1126/scitranslmed.aaf8083. [DOI] [PubMed] [Google Scholar]
  33. Fusi S, Miller EK, Rigotti M. Why neurons mix: high dimensionality for higher cognition. Current Opinion in Neurobiology. 2016;37:66–74. doi: 10.1016/j.conb.2016.01.010. [DOI] [PubMed] [Google Scholar]
  34. Gallego JA, Perich MG, Miller LE, Solla SA. Neural manifolds for the control of movement. Neuron. 2017;94:978–984. doi: 10.1016/j.neuron.2017.05.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Gentilucci M, Campione GC. Do postures of distal effectors affect the control of actions of other distal effectors? evidence for a system of interactions between hand and mouth. PLOS ONE. 2011;6:e19793. doi: 10.1371/journal.pone.0019793. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Gentilucci M, Stefani E. From gesture to speech. Biolinguistics. 2012;6:338–353. [Google Scholar]
  37. Guenther FH, Brumberg JS, Wright EJ, Nieto-Castanon A, Tourville JA, Panko M, Law R, Siebert SA, Bartels JL, Andreasen DS, Ehirim P, Mao H, Kennedy PR. A wireless brain-machine interface for real-time speech synthesis. PLOS ONE. 2009;4:e8218. doi: 10.1371/journal.pone.0008218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Guenther FH. Neural Control of Speech Movements. Cambridge MA: The MIT Press; 2016. [Google Scholar]
  39. Herff C, Schultz T. Automatic speech recognition from neural signals: a focused review. Frontiers in Neuroscience. 2016;10:1–7. doi: 10.3389/fnins.2016.00429. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Hochberg LR, Serruya MD, Friehs GM, Mukand JA, Saleh M, Caplan AH, Branner A, Chen D, Penn RD, Donoghue JP. Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature. 2006;442:164–171. doi: 10.1038/nature04970. [DOI] [PubMed] [Google Scholar]
  41. Intveld RW, Dann B, Michaels JA, Scherberger H. Neural coding of intended and executed grasp force in macaque Areas AIP, F5, and M1. Scientific Reports. 2018;8:17985. doi: 10.1038/s41598-018-35488-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Jiang W, Pailla T, Dichter B, Chang EF, Gilja V.  Decoding speech using the timing of neural signal modulation . 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); 2016. pp. 1532–1535. [DOI] [PubMed] [Google Scholar]
  43. Jun JJ, Steinmetz NA, Siegle JH, Denman DJ, Bauza M, Barbarits B, Lee AK, Anastassiou CA, Andrei A, Aydın Ç, Barbic M, Blanche TJ, Bonin V, Couto J, Dutta B, Gratiy SL, Gutnisky DA, Häusser M, Karsh B, Ledochowitsch P, Lopez CM, Mitelut C, Musa S, Okun M, Pachitariu M, Putzeys J, Rich PD, Rossant C, Sun WL, Svoboda K, Carandini M, Harris KD, Koch C, O'Keefe J, Harris TD. Fully integrated silicon probes for high-density recording of neural activity. Nature. 2017;551:232–236. doi: 10.1038/nature24636. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Kaufman MT, Seely JS, Sussillo D, Ryu SI, Shenoy KV, Churchland MM. The largest response component in the motor cortex reflects movement timing but not movement type. Eneuro. 2016;3:1171–1197. doi: 10.1523/ENEURO.0085-16.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Kiani R, Cueva CJ, Reppas JB, Newsome WT. Dynamics of neural population responses in prefrontal cortex indicate changes of mind on single trials. Current Biology. 2014;24:1542–1547. doi: 10.1016/j.cub.2014.05.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Knyazev AV, Argentati ME. Principal Angles between Subspaces in an A -Based Scalar Product: Algorithms and Perturbation Estimates. SIAM Journal on Scientific Computing. 2002;23:2008–2040. doi: 10.1137/S1064827500377332. [DOI] [Google Scholar]
  47. Kobak D, Brendel W, Constantinidis C, Feierstein CE, Kepecs A, Mainen ZF, Qi XL, Romo R, Uchida N, Machens CK. Demixed principal component analysis of neural population data. eLife. 2016;5:e10989. doi: 10.7554/eLife.10989. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Leuthardt EC, Gaona C, Sharma M, Szrama N, Roland J, Freudenberg Z, Solis J, Breshears J, Schalk G. Using the electrocorticographic speech network to control a brain-computer interface in humans. Journal of Neural Engineering. 2011;8:036004. doi: 10.1088/1741-2560/8/3/036004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Lipski WJ, Alhourani A, Pirnia T, Jones PW, Dastolfo-Hromack C, Helou LB, Crammond DJ, Shaiman S, Dickey MW, Holt LL, Turner RS, Fiez JA, Richardson RM. Subthalamic nucleus neurons differentially encode early and late aspects of speech production. The Journal of Neuroscience. 2018;38:5620–5631. doi: 10.1523/JNEUROSCI.3480-17.2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Livezey JA, Bouchard KE, Chang EF. Deep learning as a tool for neural data analysis: speech classification and cross-frequency coupling in human sensorimotor cortex. PLOS Computational Biology. 2019;15:e1007091. doi: 10.1371/journal.pcbi.1007091. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Lotte F, Brumberg JS, Brunner P, Gunduz A, Ritaccio AL, Guan C, Schalk G. Electrocorticographic representations of segmental features in continuous speech. Frontiers in Human Neuroscience. 2015;09:1–13. doi: 10.3389/fnhum.2015.00097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Makin TR, Scholz J, Henderson Slater D, Johansen-Berg H, Tracey I. Reassessing cortical reorganization in the primary sensorimotor cortex following arm amputation. Brain. 2015;138:2140–2146. doi: 10.1093/brain/awv161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Makin JG, Moses DA, Chang EF. Machine translation of cortical activity to text with an encoder-decoder framework. bioRxiv. 2019 doi: 10.1101/708206. [DOI] [PMC free article] [PubMed]
  54. Makin TR, Bensmaia SJ. Stability of sensory topographies in adult cortex. Trends in Cognitive Sciences. 2017;21:195–204. doi: 10.1016/j.tics.2017.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Martin S, Brunner P, Holdgraf C, Heinze HJ, Crone NE, Rieger J, Schalk G, Knight RT, Pasley BN. Decoding spectrotemporal features of overt and covert speech from the human cortex. Frontiers in Neuroengineering. 2014;7:1–15. doi: 10.3389/fneng.2014.00014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Martin S, Brunner P, Iturrate I, Millán JdelR, Schalk G, Knight RT, Pasley BN. Word pair classification during imagined speech using direct brain recordings. Scientific Reports. 2016;6 doi: 10.1038/srep25803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Masse NY, Jarosiewicz B, Simeral JD, Bacher D, Stavisky SD, Cash SS, Oakley EM, Berhanu E, Eskandar E, Friehs G, Hochberg LR, Donoghue JP. Non-causal spike filtering improves decoding of movement intention for intracortical BCIs. Journal of Neuroscience Methods. 2014;236:58–67. doi: 10.1016/j.jneumeth.2014.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Maynard EM, Hatsopoulos NG, Ojakangas CL, Acuna BD, Sanes JN, Normann RA, Donoghue JP. Neuronal interactions improve cortical population coding of movement direction. The Journal of Neuroscience. 1999;19:8083–8093. doi: 10.1523/JNEUROSCI.19-18-08083.1999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Meister IG, Boroojerdi B, Foltys H, Sparing R, Huber W, Töpper R. Motor cortex hand area and speech: implications for the development of language. Neuropsychologia. 2003;41:401–406. doi: 10.1016/S0028-3932(02)00179-3. [DOI] [PubMed] [Google Scholar]
  60. Moses DA, Leonard MK, Makin JG, Chang EF. Real-time decoding of question-and-answer speech dialogue using human cortical activity. Nature Communications. 2019;10:3096. doi: 10.1038/s41467-019-10994-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Mugler EM, Patton JL, Flint RD, Wright ZA, Schuele SU, Rosenow J, Shih JJ, Krusienski DJ, Slutzky MW. Direct classification of all american english phonemes using signals from functional speech motor cortex. Journal of Neural Engineering. 2014;11:035015. doi: 10.1088/1741-2560/11/3/035015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Mugler EM, Tate MC, Livescu K, Templer JW, Goldrick MA, Slutzky MW. Differential representation of articulatory gestures and phonemes in precentral and inferior frontal gyri. The Journal of Neuroscience. 2018;38:9803–9813. doi: 10.1523/JNEUROSCI.1206-18.2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Musall S, Kaufman MT, Juavinett AL, Gluf S, Churchland AK. Single-trial neural dynamics are dominated by richly varied movements. Nature Neuroscience. 2019;22:1677–1686. doi: 10.1038/s41593-019-0502-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Oby ER, Perel S, Sadtler PT, Ruff DA, Mischel JL, Montez DF, Cohen MR, Batista AP, Chase SM. Extracellular voltage threshold settings can be tuned for optimal encoding of movement and stimulus parameters. Journal of Neural Engineering. 2016;13:036009. doi: 10.1088/1741-2560/13/3/036009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Pandarinath C, Gilja V, Blabe CH, Nuyujukian P, Sarma AA, Sorice BL, Eskandar EN, Hochberg LR, Henderson JM, Shenoy KV. Neural population dynamics in human motor cortex during movements in people with ALS. eLife. 2015;4:e07426. doi: 10.7554/eLife.07436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Pandarinath C, Nuyujukian P, Blabe CH, Sorice BL, Saab J, Willett FR, Hochberg LR, Shenoy KV, Henderson JM. High performance communication by people with paralysis using an intracortical brain-computer interface. eLife. 2017;6:e18554. doi: 10.7554/eLife.18554. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Pandarinath C, Ames KC, Russo AA, Farshchian A, Miller LE, Dyer EL, Kao JC. Latent factors and dynamics in motor cortex and their application to Brain-Machine interfaces. The Journal of Neuroscience. 2018;38:9390–9401. doi: 10.1523/JNEUROSCI.1669-18.2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Pei X, Leuthardt EC, Gaona CM, Brunner P, Wolpaw JR, Schalk G. Spatiotemporal dynamics of electrocorticographic high gamma activity during overt and covert word repetition. NeuroImage. 2011;54:2960–2972. doi: 10.1016/j.neuroimage.2010.10.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Penfield W, Boldrey E. Somatic motor and sensory representation in the cerebral cortex of man as studied by electrical stimulation. Brain. 1937;60:389–443. doi: 10.1093/brain/60.4.389. [DOI] [Google Scholar]
  70. Ramsey NF, Salari E, Aarnoutse EJ, Vansteensel MJ, Bleichner MG, Freudenburg ZV. Decoding spoken phonemes from sensorimotor cortex with high-density ECoG grids. NeuroImage. 2018;180:301–311. doi: 10.1016/j.neuroimage.2017.10.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Rizzolatti G, Arbib MA. Language within our grasp. Trends Neurosci. 1998;21:188–194. doi: 10.1016/s0166-2236(98)01260-0. [DOI] [PubMed] [Google Scholar]
  72. Saxena S, Cunningham JP. Towards the neural population doctrine. Current Opinion in Neurobiology. 2019;55:103–111. doi: 10.1016/j.conb.2019.02.002. [DOI] [PubMed] [Google Scholar]
  73. Schieber MH. Constraints on somatotopic organization in the primary motor cortex. Journal of Neurophysiology. 2001;86:2125–2143. doi: 10.1152/jn.2001.86.5.2125. [DOI] [PubMed] [Google Scholar]
  74. Shenoy KV, Sahani M, Churchland MM. Cortical control of arm movements: a dynamical systems perspective. Annual Review of Neuroscience. 2013;36:337–359. doi: 10.1146/annurev-neuro-062111-150509. [DOI] [PubMed] [Google Scholar]
  75. Smith MA, Kohn A. Spatial and temporal scales of neuronal correlation in primary visual cortex. Journal of Neuroscience. 2008;28:12591–12603. doi: 10.1523/JNEUROSCI.2929-08.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Sokal RR, Michener CD. A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin. 1958;38:1409–1438. [Google Scholar]
  77. Stringer C, Pachitariu M, Steinmetz N, Reddy CB, Carandini M, Harris KD. Spontaneous behaviors drive multidimensional, brainwide activity. Science. 2019;364:eaav7893. doi: 10.1126/science.aav7893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Suresh AK, Goodman JM, Okorokova EV, Kaufman MT, Hatsopoulos NG, Bensmaia SJ. Neural population dynamics in motor cortex are different for reach and grasp. bioRxiv. 2019 doi: 10.1101/667196. [DOI] [PMC free article] [PubMed]
  79. Sussillo D, Churchland MM, Kaufman MT, Shenoy KV. A neural network that finds a naturalistic solution for the production of muscle activity. Nature Neuroscience. 2015;18:1025–1033. doi: 10.1038/nn.4042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Tankus A, Fried I, Shoham S. Structured neuronal encoding and decoding of human speech features. Nature Communications. 2012;3:1015. doi: 10.1038/ncomms1995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Tankus A, Fried I. Degradation of neuronal encoding of speech in the subthalamic nucleus in Parkinson’s Disease. Neurosurgery. 2018;84 doi: 10.1093/neuros/nyy027. [DOI] [PubMed] [Google Scholar]
  82. Tei H. Monoparesis of the right hand following a localised infarct in the left "precentral knob". Neuroradiology. 1999;41:269–270. doi: 10.1007/s002340050745. [DOI] [PubMed] [Google Scholar]
  83. Trautmann EM, Stavisky SD, Lahiri S, Ames KC, Kaufman MT, O'Shea DJ, Vyas S, Sun X, Ryu SI, Ganguli S, Shenoy KV. Accurate estimation of neural population dynamics without spike sorting. Neuron. 2019;103:292–308. doi: 10.1016/j.neuron.2019.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Vainio L, Schulman M, Tiippana K, Vainio M. Effect of syllable articulation on precision and power grip performance. PLOS ONE. 2013;8:e53061. doi: 10.1371/journal.pone.0053061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Van Der Maaten LJP, Hinton GE. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research : JMLR. 2008;9:2579–2605. [Google Scholar]
  86. Waldert S, Lemon RN, Kraskov A. Influence of spiking activity on cortical local field potentials. The Journal of Physiology. 2013;591:5291–5303. doi: 10.1113/jphysiol.2013.258228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Wesselink DB, van den Heiligenberg FMZ, Ejaz N, Dempsey-Jones H, Cardinali L, Tarall-Jozwiak A, Diedrichsen J, Makin TR. Obtaining and maintaining cortical hand representation as evidenced from acquired and congenital handlessness. eLife. 2019;8:e37227. doi: 10.7554/eLife.37227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Willett FR, Deo DR, Avansino DT, Rezaii P, Hochberg L, Henderson J, Shenoy K. Hand knob area of motor cortex in people with tetraplegia represents the whole body in a modular way. bioRxiv. 2019 doi: 10.1101/659839. [DOI] [PMC free article] [PubMed]
  89. Yang Y, Dickey MW, Fiez J, Murphy B, Mitchell T, Collinger J, Tyler-Kabara E, Boninger M, Wang W. Sensorimotor experience and verb-category mapping in human sensory, motor and parietal neurons. Cortex. 2017;92:304–319. doi: 10.1016/j.cortex.2017.04.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  90. Yousry TA, Schmid UD, Alkadhi H, Schmidt D, Peraud A, Buettner A, Winkler P. Localization of the motor hand area to a knob on the Precentral Gyrus. A new landmark. Brain. 1997;120:141–157. doi: 10.1093/brain/120.1.141. [DOI] [PubMed] [Google Scholar]

Decision letter

Editor: Tamar R Makin1
Reviewed by: Tamar R Makin2, Juan Álvaro Gallego3, Sophie K Scott4

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

The paper by Stavisky et al. provides important innovation by providing a first account of the population dynamics relating to motor control of a body part within a 'canonical' motor area of another body part. The authors utilise electrode arrays implanted in participants with paralysis to identify mouth and speech-related motor processing in the hand/arm area of motor cortex. They analyse single neurons and neural population activity during speech, and demonstrate selective responses for spoken words, syllables and orofacial movements, resulting in high classification accuracy across words/syllables. The authors further interrogate the population dynamics, previously established for hand and arm movements in this cortical area, in search for a common neural dynamics underlying motor control from this area. The paper stood out in its quality and rigour of data analysis and clarity of conceptualisation. As such, the paper offers potential innovation on multiple fronts, from basic principles of brain organisation for motor control to assistive technologies via brain-machine interfaces, and is expected to appeal to a broad audience across multiple sub-fields.

Decision letter after peer review:

Thank you for submitting your article "Neural ensemble dynamics in dorsal motor cortex during speech in people with paralysis" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Tamar R Makin as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Barbara Shinn-Cunningham as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Juan Álvaro Gallego (Reviewer #2), and Sophie K Scott (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

The paper by Stavisky et al. utilises electrode arrays implanted in participants with paralysis to identify mouth and speech-related motor processing in the hand/arm area of motor cortex. The authors analyse single neurons and neural population activity during speech, and demonstrate selective responses for spoken words, syllables and orofacial movements, resulting in high classification accuracy across words/syllables. The authors further interrogate the population dynamics, previously established for hand and arm movements in this cortical area, in search for a common neural dynamics underlying motor control from this area. While the observation that body-part assignment along the motor 'homunculus' is more broadly distributed than commonly regarded is not in itself new (as highlighted by Penfield in his seminal work), the current study provides important innovation by providing a first account of the population dynamics relating to a 'misplaced' body part. Overall, the paper is well-written, the key analyses are sound, and the results highly interesting. However, the reviewers agreed that the characterisation of the neural results in the specific context of speech production could be improved by taking into consideration the wide range different patterns of motor control, from breath control, laryngeal engagement and control of the articulators. Conceptually, the reviewers felt that further discussion is required in order to place the findings in context, with respect to face-hand overlap, as elaborated below.

Essential revisions:

1) There seems to a be a missed opportunity to consider the varying motor demands for the larynx/tongue/mouth/lips across stimuli, or even laryngeal and speech breathing mechanisms (note that metabolic breathing, which the authors have accounted for, is entirely different). In the current analysis, each of the syllables/words is studies in isolation from the others, but in term of motor control there should be some clear similarities and distinctions across these stimuli, which could also be further linked with the motor demands of the orofacial movements. For example, decoding accuracy might vary depending on similarities for motor control of these various motor mechanisms involved in speech production. This will go a long way showing that the findings observed here relate to the motor processing relating to articulation, rather than other forms of information content. More generally, these important considerations relating to the mechanisms of speech productions need to be more thoughtfully integrated in the manuscript, the authors might like to consult with an expert for this purpose.

2) Conceptually, there is a need to better consider why facial information exists in the hand area. Is that because of a unique association between the mouth and hand for language? Here the authors might like to consider commonality of gestures, and consider whether this a semantic or timing-based gestural relationship, or both? Another interesting link to consider is between speaking and reading/writing? Or topographic proximity? Alternatively, could there be nothing special between the hand and the face – there could also exist information in the hand area for feet movements? Related to that, for participant T8 – the electrode arrays are too dorsomedial to be considered as the hand area. So it seems that the results suggest that orofacial/speed-related information is present throughout motor cortex. This brings us back to the question whether the SCI, and expected E/I balance changes in the deafferented cortex might play a role in the present findings. The reviewers agreed that the conceptual framework of the study could benefit from further justification/interpretation.

3) The results are often reported in descriptive terms but are not statistically tested, making it difficult to accept some of the characteristics offered by the authors. The reviewers would like to see more quantifications in the paper, including: percentage variance explained as function of the number of components (for PCA and dPCA), pairwise angles between CI and CD dPCs together with their significance threshold (Kobak et al. proposed a method on their paper), etc. Moreover, couldn't the authors apply these methods to the syllables datasets even if they had less trials, they were sorter, and the neural activity was less consistent (they can compensate for this with the speech I think)?

4) While the classification accuracy is impressive, it’s important to dissociate between the motor control component to others relating to perception and intention. The authors mention that responses during the audio prompt were small and thus they couldn't disambiguate whether they reflect perception, movement preparation, etc (subsection “Speech-related activity in dorsal motor cortex”, first paragraph). Based on Video 1, it seems to one reviewer that there's some modulation during the prompts. Is it possible to classify rapid responses in a small window centered around the auditory cue? If decoding accuracy is significantly greater during articulation, it might be provide support for the overall interpretation of the findings.

5) Similarly, can the authors explore whether there are any rotation motifs around the prompt? This would help answer the question whether this is an inherent network property of the area, or whether it is specific for movement planning.

6) The neural population analyses look quite different for the two patients: 1) for T8 there's only one CI dPC, and it explains roughly the same amount of variance as the leading CD dPC, whereas for T5 there are two CI dPCs that explain several times more variance than any CD dPC; 2) the rotational structure identified with jPC is not above the chance level for T8, only for T5. We understand that these differences may very well be motivated by the worse quality of T8's arrays, but the authors should be more cautious in some parts of the paper given these differences and their n=2, e.g., in the Abstract. Moreover, this difference should be addressed to a greater extent in the Discussion.

7) The authors suggest that the hand area might play a role in speech production. Here they seem to conflate correlation with causation – their findings do not provide any support that this decoding information available in the hand area is actually utilised during speech motor control.

eLife. 2019 Dec 10;8:e46015. doi: 10.7554/eLife.46015.sa2

Author response


[…] Overall, the paper is well-written, the key analyses are sound, and the results highly interesting. However, the reviewers agreed that the characterisation of the neural results in the specific context of speech production could be improved by taking into consideration the wide range different patterns of motor control, from breath control, laryngeal engagement and control of the articulators. Conceptually, the reviewers felt that further discussion is required in order to place the findings in context, with respect to face-hand overlap, as elaborated below.

In response to the reviewers’ feedback, we have added additional data and analyses to better relate this activity to the motoric demands of speech and volitional breath control, and we have expanded the Discussion to better place these findings in context. We have also added additional discussion of interpretation limitations based on the reviewers’ feedback. These changes, as well as many others, are described in more detail in responses to specific comments below.

With regards to the novelty of a broadly distributed homunculus, we appreciate the feedback that this is not entirely new. We were surprised by our results because we are unaware of previous studies showing this extent of distribution (i.e., face activity in this hand knob area), even after taking into account recent work describing fractured somatotopy within major body regions and the partially overlapping distributions of the original Penfield work (which was both due to within-subject and across-subjects effects). Thus, on this topic we hope (and believe) that our report of mixed tuning at the level of single neurons is an interesting and novel contribution that provides additional data of relevance to an emerging view of broad motor maps.

Essential revisions:

1) There seems to a be a missed opportunity to consider the varying motor demands for the larynx/tongue/mouth/lips across stimuli, or even laryngeal and speech breathing mechanisms (note that metabolic breathing, which the authors have accounted for, is entirely different). In the current analysis, each of the syllables/words is studies in isolation from the others, but in term of motor control there should be some clear similarities and distinctions across these stimuli, which could also be further linked with the motor demands of the orofacial movements. For example, decoding accuracy might vary depending on similarities for motor control of these various motor mechanisms involved in speech production. This will go a long way showing that the findings observed here relate to the motor processing relating to articulation, rather than other forms of information content. More generally, these important considerations relating to the mechanisms of speech productions need to be more thoughtfully integrated in the manuscript, the authors might like to consult with an expert for this purpose.

Thank you for your feedback that the manuscript could, and should, do more to relate the observed speech-related neural activity with the underlying motor actions (the movements of the larynx/tongue/mouth/lips, and breathing) required to produce these sounds. We absolutely agree. We describe below (1) new measurements and analyses that we plan to do (but consider to be future research) now that these new cortical responses are known about and initially characterized, and (2) the considerable amount of new analysis work, including newly collected data, that we have done and include in the revised manuscript to directly address these points.

Regarding (1) above, we start out below by first describing – in some detail – for the reviewers and editors our future research roadmap which will invest considerably in pursuing exactly these questions. These are major research endeavors in and of themselves which will span considerable periods of time (likely a year or more, involving new FDA approval and new research infrastructure), and will lead to subsequent full reports. Regarding (2) above, we then turn to a considerable amount of new data analysis that we have done, including the use of newly collected data, that we have integrated into the revised manuscript (including a new figure), to directly address the questions raised. We believe that these analyses and results add considerably to the manuscript and, again, we are grateful for the reviewers’ and editors’ suggestion to more deeply pursue this topic.

(1) We think the best way forward toward answering in full detail the question regarding neural mechanisms of speech production is to collect new data and bring in new technical capabilities so that we can either measure speech articulator movements directly, or infer them from recorded audio (for example using new audio-articulator inversion ‘AAI’ methods like in Chartier et al., 2018, which unfortunately are not yet publicly available). Such data would overcome our current limitations: although we know what syllable/word the participants spoke and what prompted orofacial movements they made, we don’t have moment-by-moment kinematic measurements. Also, our original data sampled relatively few orofacial and speech movements. Without the underlying kinematic measurements and without comprehensive sampling of different combinations of kinematics, it becomes very difficult to attribute measured neural activity to specific articulatory degrees-of-freedom, since even short syllables or prompted movements are coordinated high-dimensional movements of multiple articulators.

We are investing considerably to be able to collect data that overcomes this limitation in the future, with: (A) a major purchase of Electromagnetic Midsagittal Articulography equipment, (B) planning for making a request for FDA regulatory approval to use these (uncomfortable but safe) techniques with our clinical trial participants, if they agree to do so, and (C) a newly established collaboration. However, doing all of this well this is a major undertaking – that will lead to one or more major additional full reports – and that realistically will extend out for over a year. Thus we view this as future work, and outside the scope of the present manuscript. We have added a Discussion paragraph (see below) laying out that we think this is a promising future direction in which to build on the present work. Our hope is that the present work, which provides a first description of this speech/orofacial single neuron activity in this cortical area, will lay the groundwork for this subsequent work.

This is a key way in which we believe we are addressing, as requested, the reviewer’s comment that “these important considerations relating to the mechanisms of speech productions need to be more thoughtfully integrated in the manuscript” in the longer term. Now we turn to (2), which addresses this request more directly by describing new analyses with new data that we have performed and added to the revised manuscript.

(2) We recognize that the manuscript would benefit from additional examination of how the neural correlates of different spoken sounds relate to their motor demands, and we are grateful to the reviewers and editors for suggesting this. To this end, we have performed a key new analysis on newly collected data. We asked participant T5 to speak 420 different words (3 times each) chosen to broadly sample 41 American English phonemes (we unfortunately could not repeat this new data collection in participant T8, who is no longer in the clinical trial). We then hand-segmented each of these words’ audio data into individual phonemes, and compared the neural ensemble correlates of these phonemes. This new result reveals that phonemes’ neural correlates clustered based on phonetic groupings and place of articulation, consistent with this activity being related to the underlying motor demands for producing the phonemes. See Figure 1—figure supplement 5.

We view this new result as consistent with the hypothesis that this neural activity is related to speech articulator movements. We believe that this publication helps motivate future work, by our group and potentially by other groups too, to further pursue the relationship between neural activity and articulatory kinematics.

This new dataset and its task are described in the ‘Many words task’ Materials and methods section; further details of the analysis are described in the ‘Comparing different phonemes’ neural correlates’ Materials and methods section. The added Results section text reads:

“Second, analysis of an additional dataset in which participant T5 spoke 41 different phonemes revealed that neural population activity showed phonemic structure (Figure 1—figure supplement 5): for example, when phonemes were grouped by place of articulation (Bouchard et al., 2013; Lotte et al., 2015; Moses et al., 2019), population firing rate vectors were significantly more similar between phonemes within the same group than between phonemes in different groups (p<0.001, shuffle test).”

Finally, regarding this figure, we would be happy to elevate it to a figure in the main text (instead of a supplementary figure) if the reviewers and/or editors recommend this. While we indeed view this as an important analysis and figure, and are grateful that the reviewers and editors suggested that we pursue this, we also want to be mindful of not making the manuscript too long or over-emphasizing a single participant result. Thus, we pose this question here.

We share your prediction that breathing may contribute a component of our observed speech-related neural activity, and that it will be useful to study the neural correlates of breathing in the context of speech production, which is distinct from the metabolic breathing we studied here. As discussed above, we think that the best way forward on this question is to include breathing amongst many continuous speech articulatory kinematics in a future study, so that each of these movements’ distinct partial correlations with neural activity can be disentangled.

With that said, we are grateful to the reviewers and editors for suggesting that we pursue this a bit further, as we believe that we were able to take an additional step towards understanding breath-related activity. We did so by analyzing data from an additional ‘instructed breathing task’ in which breathing was under the participant’s conscious control. This new behavioral context is now analyzed alongside the previously presented unattended breathing data. The updated Figure 2—figure supplement 2, shows that volitional breathing also modulates hand knob cortex.

Please note that compared to the initial submission’s Figure 2—figure supplement 2, the shuffle distributions (panel C, D horizontal dashed lines) have shifted; this is because a) we caught and fixed a bug in the shuffle ordering code, and b) we changed the significance threshold to 0.01 (from 0.001) to maintain sensitivity after this fix and accommodate the reduced trial count of the new instructed breathing condition. This change does not affect our conclusions: besides moving the panel C, D lines, the net effect is that we now report that 17 (rather than 18) of the single neurons were significantly correlated with breathing. We apologize for this mistake and have carefully checked our code throughout the manuscript.

Below we have copied the updated Discussion paragraph that summarizes our evidence supporting that the observed neural activity is related to motor control, and suggests future work looking at speech kinematics:

“Our data suggest that the observed neural activity reflects movements of the speech articulators (the tongue, lips, jaw, and larynx): modulation was greater during speaking than after hearing the prompt; the same neural population modulated during non-speech orofacial movements; and in T5, the neural correlates of producing different phonemes grouped according to these phonemes’ place of articulation. […] A deeper understanding of how motor cortical spiking activity relates to complex speaking behavior will require future work connecting it to continuous articulatory (Chartier et al., 2018; Conant et al., 2018; Mugler et al., 2018) and respiratory kinematics and, ideally, the underlying muscle activations.”

2) Conceptually, there is a need to better consider why facial information exists in the hand area. Is that because of a unique association between the mouth and hand for language? Here the authors might like to consider commonality of gestures, and consider whether this a semantic or timing-based gestural relationship, or both? Another interesting link to consider is between speaking and reading/writing? Or topographic proximity? Alternatively, could there be nothing special between the hand and the face – there could also exist information in the hand area for feet movements? Related to that, for participant T8 – the electrode arrays are too dorsomedial to be considered as the hand area. So it seems that the results suggest that orofacial/speed-related information is present throughout motor cortex. This brings us back to the question whether the SCI, and expected E/I balance changes in the deafferented cortex might play a role in the present findings. The reviewers agreed that the conceptual framework of the study could benefit from further justification/interpretation.

Thank you, and we agree that the manuscript would benefit from more discussion of why there might be face information in “hand” area of motor cortex. While we originally speculated that this was due to the kinds of hand-mouth linkages enumerated by the reviewers and editors (based on previous studies such as those referenced in our Introduction), new results from our group (currently in the pre-print stage) have made us re-evaluate this interpretation. Inspired by finding face activity in hand knob area, we then tested whether there was modulation during actual and attempted movements of other body parts, including the neck, ipsilateral arm, and legs, exactly as you proposed. We found that indeed there is representation of every body part tested (Willett, et al., bioRxiv 2019). In light of this, our interpretation of these results is that finding speech-related activity in this cortical area is a consequence of motor representations being much more distributed at the single-neuron level than we previously imagined, rather than a “special” hand-face relationship (though we can’t rule that out, and it would be interesting to explicitly examine coordinated hand-face movements in future work). We have updated our Discussion accordingly, and we have also added a new paragraph that explicitly calls out that we are far from resolving what the “purpose” of this speech-related activity in hand knob area is (if any) and that we feel this is an important, though difficult, question for future research:

“There are three main findings from this study. […] Thus, the observed neural overlap between hand and speech articulators may be a consequence of distributed whole-body coding, rather than a privileged speech-manual linkage.”

“Assuming that these results are not due to injury-related remapping, we are left with the question of why this speech-related activity is found in dorsal “arm and hand” motor cortex. […] We anticipate that it will require substantial future work to understand why speech-related activity co-occurs in the same motor cortical area as arm and hand movement activity, but that this line of inquiry may reveal important principles of how sensorimotor control is distributed across the brain (Musall et al., 2019; Stringer et al., 2019).”

It is our hope that an impact of this manuscript will be to help motivate further work to understand this (fascinating, to us) phenomenon and better appreciate the complexity of human motor representations.

We have expanded the Discussion paragraph about why we think the presence of speech activity in hand knob cortex is not due to cortical remapping following SCI to incorporate this new whole-body tuning evidence. We have also added new references. This paragraph is reproduced below for convenience:

“An important unanswered question, however, is to what extent these results were potentially influenced by cortical remapping due to tetraplegia. […] While these threads of evidence argue against remapping, definitively resolving this ambiguity would require intracortical recording from this eloquent brain area in able-bodied people.”

Regarding the placement of T8’s arrays: placement was guided anatomically by definitively identifying Yousry’s “hand knob” area, which has distinctive contours on volumetrically obtained MRI images and can be identified with a high degree of certainty (>97%) (described in Yousry, 1997). That said, we recognize that the extent of across-individual anatomy differences could raise questions about the accuracy and utility of generalized terms like “hand knob”, despite its adoption by neurosurgeons as a distinct anatomical structure. As further evidence for correct array placement, the functional properties of this area (strong hand and arm-related tuning) are also consistent with these arrays being in the same hand area as T5’s. We have added additional details to the Materials and methods to explain how the arrays were targeted:

“Both participants had two 96-electrode Utah arrays (1.5 mm electrode length, Blackrock Microsystems, USA) neurosurgically placed in dorsal ‘hand knob’ area of the left (motor dominant) hemisphere’s motor cortex. Surgical targeting was stereotactically guided based on prior functional and structural imaging (Yousry, 1997), and subsequently confirmed by review of intra-operative photographs.”

3) The results are often reported in descriptive terms but are not statistically tested, making it difficult to accept some of the characteristics offered by the authors. The reviewers would like to see more quantifications in the paper, including: percentage variance explained as function of the number of components (for PCA and dPCA), pairwise angles between CI and CD dPCs together with their significance threshold (Kobak et al. proposed a method on their paper), etc. Moreover, couldn't the authors apply these methods to the syllables datasets even if they had less trials, they were sorter, and the neural activity was less consistent (they can compensate for this with the speech I think)?

Thank you for pointing out that our neural population dynamics results and claims will be more strongly supported with additional quantifications and the inclusion of the syllables datasets. We have generated a new Figure 4—figure supplement 1 that provides these additional details, including cumulative variance explained for dPCA and jPCA and pairwise angles between dPCs (including the significance testing from Kobak et al., 2016). These quantifications are also now described in the ‘Condition-invariant signal’ Materials and methods section.

As per your suggestion, this supplementary figure also includes each participant’s syllables datasets, as well as two new T5 replication datasets. These additional ‘T5-5words-A’ and ‘T5-5words-B’ datasets were collected as part of a follow-up study, but we are happy to pull them in to this work and process them using the exact same analysis parameters used for the original datasets to build more confidence in the robustness of our findings. We believe that the manuscript is substantially strengthened by showing the consistency of these two neural population dynamics motifs across more datasets.

The additional datasets are discussed in the following updated Results passages:

“We found that these two prominent population dynamics motifs were indeed also present during speaking. […] These results were also robust across different choices of how many dPCs to summarize the neural population activity with (Figure 4—figure supplement 2).”

“Lastly, we looked for rotatory population dynamics around the time of acoustic onset. Figure 5A shows ensemble firing rates projected into the top jPCA plane. […] As was the case for the condition-invariant dynamics, these results were also consistent across additional datasets (Figure 4—figure supplement 1E-H) and across the choice of how many PCA dimensions in which to look for rotatory dynamics (Figure 4—figure supplement 2B).”

Please note that the Figure 4—figure supplement 1C dPC pairwise angles insets are a superset of the information provided in the original Figure 4 CIS1 vs. CD1,2 insets, so we have removed those. We now use a similar visual format in the new Figure 4—figure supplement 1E to compare the CIS1 versus the top jPCA plane, which we think is an interesting comparison to document. Perhaps unsurprisingly, the CIS1 is nearly orthogonal to the jPC dimensions, though we are careful to note that this need not be the case and that the model of a CIS that shifts dynamics into a different regime for movement generation does not require this to be the case (one could even imagine a CIS that shifts the neural state to a very different position in the exact same neural subspace also acting as a “trigger” for rotatory dynamics). The end of the Results now reads:

“We note that existing models of how a condition-invariant signal “kicks” dynamics into a different state space region where rotatory dynamics unfold (Kaufman et al., 2016; Sussillo et al., 2015) do not require that the CIS and rotatory dynamics must be orthogonal, but in these data we did observed that the CIS1 and jPCA dimensions were largely orthogonal (Figure 4—figure supplement 1E).”

4) While the classification accuracy is impressive, it’s important to dissociate between the motor control component to others relating to perception and intention. The authors mention that responses during the audio prompt were small and thus they couldn't disambiguate whether they reflect perception, movement preparation, etc (subsection “Speech-related activity in dorsal motor cortex”, first paragraph). Based on Video 1, it seems to one reviewer that there's some modulation during the prompts. Is it possible to classify rapid responses in a small window centered around the auditory cue? If decoding accuracy is significantly greater during articulation, it might be provide support for the overall interpretation of the findings.

There is indeed (small) modulation after the prompt, which by the way we now quantify in a better way. Thank you for suggesting that we further quantify how much word/syllable-specific information is present in this prompt activity using a similar decoding approach as when decoding the speaking epoch activity. We have added this analysis, which shows very poor prompt-epoch classification performance, to the manuscript (see Figure 3C). As you said, this further supports that this activity is related to speech production.

The updated Results passage is:

“We next performed a decoding analysis to quantify how much information about the spoken syllable or word was present in the time-varying neural activity. […] The much higher neural discriminability of syllables and words during speaking rather than after hearing the audio prompt is consistent with the previously enumerated evidence that modulation in this cortical area is related to speech production.”

5) Similarly, can the authors explore whether there are any rotation motifs around the prompt? This would help answer the question whether this is an inherent network property of the area, or whether it is specific for movement planning.

Thank you for this valuable suggestion. We have now performed the same jPCA rotatory dynamics analysis on an epoch (of the same length as the main Figure 5 analyses) shortly after the prompt. These results are shown in the newly added Figure 4—figure supplement 1H, and reveal that there were not rotatory dynamics after the prompt. In the interest of space, and since these prompt rotations were not significant, for this analysis we only show the variance explained summary statistic (right below the significant speech-epoch statistics, for contrast) and not the neural trajectories in the top jPCA plane. In addition to being relevant to the wider question of how ubiquitous (across behaviors) and specific (across time epochs) neural rotations are, this new analysis also provides an empirical control that jPCA doesn’t just trivially find significant rotations in any neural data.

These new results are described in the Results:

“Lastly, we looked for rotatory population dynamics around the time of acoustic onset. Figure 5A shows ensemble firing rates projected into the top jPCA plane. […] As was the case for the condition-invariant dynamics, these results were also consistent across additional datasets (Figure 4—figure supplement 1E-H) and across the choice of how many PCA dimensions in which to look for rotatory dynamics (Figure 4—figure supplement 2B).”

6) The neural population analyses look quite different for the two patients: 1) for T8 there's only one CI dPC, and it explains roughly the same amount of variance as the leading CD dPC, whereas for T5 there are two CI dPCs that explain several times more variance than any CD dPC; 2) the rotational structure identified with jPC is not above the chance level for T8, only for T5. We understand that these differences may very well be motivated by the worse quality of T8's arrays, but the authors should be more cautious in some parts of the paper given these differences and their n=2, e.g., in the Abstract. Moreover, this difference should be addressed to a greater extent in the Discussion.

Thank you for your feedback that there was not enough discussion of the neural population analyses differences between the two participants, and that these differences warrant caution when interpreting the results. We have made a number of manuscript changes which we believe address this:

First of all, after improving our analysis methods for quantifying population-wide task-related modulation, we realized that our speech initiation analysis epoch of 200 ms before the go cue to 400 ms after go, which we had originally selected when initially analyzing T5’s data, was a poor choice for participant T8 because his recorded neural modulation occurs later than T5’s. This choice of a premature (minimally modulating) epoch exacerbated the differences between participants (in addition to the worse array quality, as mentioned by the reviewer). We have now changed T8’s CIS analysis epoch to 100 ms to 700 ms after go, which re-focuses this analysis on a post-go “modulation ramp up” epoch that is more similar to T5’s. This yields CIS results that look much more similar between the two participants (see Figure 4).

We also described the reasoning for the different epochs in the Materials and methods:

“Trial-averaged firing rates were calculated from a speech initiation epoch of 200 ms before go cue to 400 ms after the go cue for T5, and 100 ms to 700 ms after the go cue for T8. T8’s epoch was shifted later relative to T5’s to account for T8’s later neural population activity divergence from the silent condition (Figure 1—figure supplement 4B).”

Second, we have changed the Results section presenting these results to address the differences between the two participants’ CIS results:

“We found that these two prominent population dynamics motifs were indeed also present during speaking.[…] This lower signal-to-noise ratio can also be appreciated in how the “elbow” of T8’s cumulative neural variance explained by PCA or dPCA components (Figure 4—figure supplement 1A, B) occurs after fewer components and explains far less overall variance.”

Third, we now also revisit the non-significant T8 rotatory dynamics result in the updated Discussion section, which now reads:

“Our third main finding is that two motor cortical population dynamical motifs present during arm movements were also significant features of speech activity. We observed a large condition-invariant change at movement initiation in both participants, and rotatory dynamics during movement generation in the one of two participants whose arrays recorded substantially more modulation.”

Fourth, we have added the n=2 to the Abstract:

“Speaking is a sensorimotor behavior whose neural basis is difficult to study with single neuron resolution due to the scarcity of human intracortical measurements. We used electrode arrays to record from the motor cortex ‘hand knob’ in two people with tetraplegia, an area not previously implicated in speech.”

Relatedly, we also now address the differences in the two participant’s decoding performance in the Discussion:

“That said, these results are only a first step in establishing the feasibility of speech BCIs using intracortical electrode arrays. […] We also observed worse decoding performance in participant T8, highlighting the need for future studies in additional participants to sample the distribution of how much speech-related neural modulation can be expected, and what speech BCI performance these signals can support.”

Also, please note that T5’s CIS dPCA Figure 4 plots have changed very slightly from the original submission because when revisiting these analyses, we noticed that we had been insufficiently regularizing the dimensionality reduction/variance partition process such that the dPCs didn’t generalize as well to held out data. We have now used the dPCA code’s built-in cross-validated regularization parameter optimization and verified that the resulting dimensionality reduction generalizes well if we do dPCA on only half the data and then compare the resulting dimensions and variance partitions when projecting the other (held-out) half of the data into these dPCs. We have added this detail to the ‘Condition-invariant signal’ Materials and methods section:

“Default dpca function parameters were used, with parameters numRep = 10 (repetitions for regularization cross-validation) and simultaneous = true (indicating that the single-trial neural data were simultaneously recorded across electrodes) for the dpca_optimizeLambda and dpca_getNoiseCovariance functions.”

7) The authors suggest that the hand area might play a role in speech production. Here they seem to conflate correlation with causation – their findings do not provide any support that this decoding information available in the hand area is actually utilised during speech motor control.

Thank you for pointing out that as originally written, our Discussion came across as suggesting that these data indicate a causal role of hand area in speech production. We apologize for this, as we absolutely agree that we have only observed correlation with speaking, and no evidence for a causal role. We have updated several sections of the Discussion, reproduced below, to be more cautious in speculating, what, if any, role this activity might have in speech or coordinating speech and hand movements:

“There are three main findings from this study. First, these data suggest that ‘hand knob’ motor cortex, an area not previously known to be active during speaking (Breshears et al., 2015; Dichter et al., 2018; Leuthardt et al., 2011; Lotte et al., 2015), may in fact participate, or at least receive correlates of, neural computations underlying speech production.”

“Assuming that these results are not due to injury-related remapping, we are left with the question of why this speech-related activity is found in dorsal “arm and hand” motor cortex. […] We anticipate that it will take substantial future work to understand why speech-related activity co-occurs in the same motor cortical area as arm and hand movement activity, but that this line of inquiry may reveal important principles of how sensorimotor control is distributed across the brain (Musall et al., 2019; Stringer et al., 2019).”

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Source data 1. Breathing data.
    elife-46015-data1.zip (24.2MB, zip)
    Source data 2. Classification data.
    elife-46015-data2.zip (13.3MB, zip)
    Source data 3. Dynamics data.
    elife-46015-data3.zip (22.7MB, zip)
    Source data 4. PSTHs sorted units data.
    elife-46015-data4.zip (49.7MB, zip)
    Source data 5. Syllables PSTHS TCs data.
    elife-46015-data5.zip (90.5MB, zip)
    Source data 6. Tuning and behavior data.
    elife-46015-data6.zip (13.9MB, zip)
    Source data 7. Video go aligned data.
    elife-46015-data7.zip (73.8MB, zip)
    Source data 8. Video and dynamics speak aligned data.
    elife-46015-data8.zip (74.8MB, zip)
    Source data 9. Words PSTHs TCs data.
    elife-46015-data9.zip (93.3MB, zip)
    Transparent reporting form

    Data Availability Statement

    The sharing of the raw human neural data is restricted due to the potential sensitivity of this data. These data are available upon request to the senior authors (KVS or JMH). To respect the participants' expectation of privacy, a legal agreement between the researcher's institution and the BrainGate consortium would need to be set up to facilitate the sharing of these datasets. Processed data is provided as source data, and analysis code is available at https://github.com/sstavisk/speech_in_dorsal_motor_cortex_eLife_2019 (copy archived at https://github.com/elifesciences-publications/speech_in_dorsal_motor_cortex_eLife_2019).


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES