Skip to main content
Science Advances logoLink to Science Advances
. 2023 Jun 9;9(23):eadh0478. doi: 10.1126/sciadv.adh0478

Decoding and synthesizing tonal language speech from brain activity

Yan Liu 1,2,3,4,, Zehao Zhao 1,2,3,4,, Minpeng Xu 5,6,, Haiqing Yu 5, Yanming Zhu 1,2,3,4, Jie Zhang 1,2,3,4, Linghao Bu 1,2,3,4,7, Xiaoluo Zhang 1,2,3,4, Junfeng Lu 1,2,3,4,8,*, Yuanning Li 9,*, Dong Ming 5,6, Jinsong Wu 1,2,3,4,*
PMCID: PMC10256166  PMID: 37294753

Abstract

Recent studies have shown that the feasibility of speech brain-computer interfaces (BCIs) as a clinically valid treatment in helping nontonal language patients with communication disorders restore their speech ability. However, tonal language speech BCI is challenging because additional precise control of laryngeal movements to produce lexical tones is required. Thus, the model should emphasize the features from the tonal-related cortex. Here, we designed a modularized multistream neural network that directly synthesizes tonal language speech from intracranial recordings. The network decoded lexical tones and base syllables independently via parallel streams of neural network modules inspired by neuroscience findings. The speech was synthesized by combining tonal syllable labels with nondiscriminant speech neural activity. Compared to commonly used baseline models, our proposed models achieved higher performance with modest training data and computational costs. These findings raise a potential strategy for approaching tonal language speech restoration.


Tonal language speech can be decoded and synthesized from direct neural recording using a modularized multi-stream neural network.

INTRODUCTION

Lexical tones play an important role in providing lexical information and differentiating individual words in the speech of tonal languages. More than 60% of the languages in the world are tonal (1). Approximately 2 billion people speak tonal languages, including most Sino-Tibetan languages, the entire Tai-Kadai family, and so on (2). The most important characteristic of these languages is the use of pitch to distinguish lexical and grammatical meaning. A syllable with the same segmental features (consonants and vowels) but different inherent pitch contours can represent different words (3). Taking Mandarin as an example, the segmental feature “ma” with four tones (tone 1, tone 2, tone 3, and tone 4) can mean “mother (妈)”, “hemp (麻)”, “horse (马)”, and “scold (骂),” respectively (4). There are more than 6000 characters (single syllable words) used daily in Mandarin. However, the combinations of segmental features constitute only approximately 400 unique base syllables (syllables without tones). With the four lexical tones added, this number rises to approximately 1300 unique tone syllables (some combinations do not exist in Mandarin). These 1300 tone syllables cover all the commonly used characters in Mandarin. Therefore, accurately producing and identifying tones are critical for spoken communications in tonal languages.

Several recent studies have shown the feasibility of synthesizing the acoustic sound of short sentences and a few specific words in nontonal languages, such as English (59) and Japanese (10), from intracranial neural recordings, such as electrocorticography (ECoG). These advances not only provide methods for anarthria treatment (11) but also increase the communication efficiency of speech brain-computer interfaces (BCIs). Previous speech BCIs were mainly built upon the neural basis of speech production in the ventral sensorimotor cortex (vSMC), where spatiotemporal patterns of population neural activity encode articulatory movements of the vocal tract that generate phoneme sequences of speech (1215). However, as a tonal language, Mandarin Chinese differs significantly from nontonal languages in neurolinguistics (16, 17) and natural language processing techniques (1820). In addition to the production of phoneme sequences, which is the critical component for nontonal languages, tone production is also an essential and complex part of both the acoustic model and the vocoder in tonal language synthesis (21). With the significant differences in the articulation of pitch dynamics between tonal and nontonal languages, the corresponding neural decoding algorithms should also vary. Therefore, it is not straightforward to adapt previous advances in English BCI studies to the BCI applications for speakers and patients of tonal languages. As a result, these BCI frameworks and decoding algorithms are required to be developed specifically for tonal languages.

In this study, we aimed to synthesize speech in a tonal language from invasive neural recordings using high-density ECoG. Considering that a tonal syllable can be divided into tone and base syllable that are independent of each other, we proposed a divide-and-conquer framework. We hypothesized that tone and base syllable can be decoded separately from the neural activity and then tonal speech can be synthesized using the combination of the decoded tone and base syllable (22, 23). To test this hypothesis, we designed a multistream modularized neural network model. Specifically, we used different neural network modules targeting functionally distinct neural populations in speech-related brain areas. These modules decode tone labels and base syllable labels in parallel and then synthesize tonal syllable speech by combining the outputs of the tone and syllable modules.

RESULTS

Distinct spatiotemporal coding of lexical tones and tonal syllables in the cortical language network

The first step toward developing a speech BCI for tonal language was to understand how lexical tones and tonal syllables were represented in the cortex during speech production of tonal languages. To accomplish this, we recorded the neural activity of five native Mandarin-speaking participants who underwent awake neurosurgery to treat brain tumors. During the experiment, these participants were instructed to speak eight designated tonal syllables: mā (55), má (35), mǎ (214), mà (51), mī (55), mí (35), mǐ (214), and mì (51); and their brain activity was recorded by temporally implanted high-density ECoG grids (as shown in Figs. 1A and 2A). We computed the amplitude of individual electrode signals in the high-γ band (70 to 150 Hz), which is known to reflect local neuronal activity (23, 24). By comparing the average high-γ response profiles under different acoustic-phonetic conditions, such as different tones or syllables, we tested whether spatially distinct populations encode different aspects of tonal speech, such as tone and syllable. On the basis of their differential response profiles, we sorted the electrodes into five categories: tone discriminative, syllable discriminative, tone and syllable discriminative, nondiscriminative, and nonresponsive. In total, 46 electrodes were found to be tone discriminative in five participants, while 44 electrodes were syllable discriminative. In addition, 14 electrodes were both tone and syllable discriminative (as shown in Fig. 1A). This finding suggested that most tone discriminative neural populations were independent of the syllable discriminative populations (Fig. 1B). Furthermore, the mean difference between tones in the tone discriminative electrodes was significantly smaller than the mean difference between syllables in the syllable-discriminative electrodes (t test, P < 0.001; fig. S1). As a result, we relied on different temporal coding strategies to decode tonal and syllabic information separately from these distinct categories of electrodes.

Fig. 1. Electrode coverage and category.

Fig. 1.

(A) Anatomical reconstructions of all participants. The locations of the ECoG electrodes were plotted with colored discs. The colors indicated the electrode categories (see Materials and Methods). (B) Venn diagram of all speech-responsive electrodes in all participants, broken down into four categories (nonresponsive category was not plotted). (C) The averaged high-γ responses regarding different lexical tones during tone production from five example electrodes, time-locked to speech onsets [electrode locations were plotted in (A) as stars in different colors]. Black dots indicated time points of significance. For the top row in (C), black dots indicated the time when there was a significant difference in the mean high-γ activity between the two syllables. (t test, P < 0.05, Bonferroni corrected). For the bottom row in (C), black dots indicated the time when there was a significant difference in the mean high-γ activity between the four tones. (F test, P < 0.05, Bonferroni corrected).

Fig. 2. The model architecture and speech synthesis pipeline.

Fig. 2.

(A) Each participant articulated the eight tonal syllables and their neural activity was recorded with ECoG grids (256 electrodes) covering the peri-Sylvian cortices. The analytic amplitudes of the high-γ activity (70 to 150 Hz) were extracted and clipped to the length of 1 s and supplied as input to the speech decoding model. The electrodes were classified into one of five categories and then fed into different decoding streams according to their category assignments. The illustration of the five-level tone marks demonstrated the pitch contours of flat-high tone (tone 1), medium-rising tone (tone 2), low-dipping tone (tone 3), and high-falling tone (tone 4) in Mandarin. (B) Tone discriminative electrodes (e.g., electrode E1 in Fig. 1C) and tone and syllable discriminative electrodes (e.g., electrode E3 in Fig. 1C) were fed into a parallel CNN-LSTM network to generate the tone label. (C) Syllable discriminative electrodes (e.g., electrode E2 in Fig. 1C) and tone and syllable discriminative electrodes (e.g., electrode E3 in Fig. 1C) were fed into a sequential CNN-LSTM network to generate the syllable label. (D) The synthesis network combined the signals from nondiscriminative electrodes and the outputs of (B) and (C) to generate the Mel spectrogram of speech sound. (E) The sound wave was synthesized from the Mel spectrogram via the Griffin-Lim algorithm (see audio S1).

Speech decoder design

We designed a modularized multistream neural network model to decode and synthesize speech from ECoG recordings. The whole model was divided into two main parts, as shown in Fig. 2. The first part was a label generator, which included a tonal label generator (Fig. 2B) and a syllable label generator (Fig. 2C). The second part was a synthesizer model that generated speech (Fig. 2D). The label generators took input from tone and syllable discriminative populations and generated discretized tone labels sand syllable labels. The tonal label generator decoded tonal labels from tone discriminative electrodes as well as the tone and syllable discriminative electrodes. To take full advantage of the spatiotemporal coding of tones, the generator consisted of a parallel convolutional neural network–long short-term memory (CNN-LSTM) dual stream input that exploited both spatial and temporal coding. The combined output was then fed forward through additional sequential CNN-LSTM layers to generate discrete tonal labels. The syllable label generator mainly consisted of a sequential CNN-LSTM network to perform spatial decoding. This generator decoded syllable labels from both the electrodes of the syllable discriminative category and the electrodes of the tone and syllable discriminative category.

A unique and critical component of the speech decoder was the synthesizer module. This module combined the label outputs from the discriminative electrodes and the input from nondiscriminative speech electrodes to generate a Mel spectrogram of speech. We hypothesized that the labels transcribed with five level tone marks and one-hot labels of syllables contained sufficient information about the activity of the jaw, lips, tongue, and larynx, which were responsible for the segmental features. This part was learned by a LSTM layer. Other suprasegmental features, such as intensity and duration, were synthesized by a CNN that took input from the nondiscriminative electrodes (e.g., electrode E4 in Fig. 1C). The synthesizer transformed these two streams of time domain signals, including the nondiscriminative electrode input and the tonal syllable labels transcribed with five-level tone marks, to the frequency domain Mel spectrogram. This was done using two-dimensional (2D) CNNs (25). Last, we used the Griffin-Lim algorithm to transform the Mel spectrogram into a sound wave at 24,414 Hz (Fig. 2E).

Label generator performance

We evaluated the performance of our proposed model and compared it with several common baselines. The baseline models included (i) a fully convolutional network [Visual Geometry Group Network consisting of 16 convolutional layers (VGG16)] (26), with all electrodes as input; (ii) a sequential CNN-LSTM model (Fig. 2C), with all tone/syllable discriminative electrodes as input; and (iii) a parallel CNN-LSTM model (Fig. 2B), with all tone/syllable discriminative electrodes as input.

With the limited data volume, we speculated that tonal syllable decoding with the whole ECoG grid would be greatly influenced by the electrical signal from redundant electrodes (nondiscriminative and nonresponsive electrodes) as the internal noise of the decoding task. To test this, we used all 256 channels as input data to decode the tonal syllable via a fully convolutional network (VGG16). Although VGG16 is one of the most popular models in deep learning and is an excellent backbone for classification, it could not converge to practical accuracy (Fig. 3C; mean accuracy = 15.6 to 33.4%, chance level = 12.5%). On the other hand, we selected the tone discriminative electrodes, syllable discriminative electrodes, and tone and syllable discriminative electrodes as the input data to reduce input feature dimensions. When we fed these input data into a simple neural network with six convolutional layers and one dense layer (noted as the CNN model in Fig. 3C), the accuracy increased to 37.1 to 53.8%, which was significantly higher than VGG16 and the chance level (Fig. 3C; P < 0.001, t test). This result suggested that suitable preprocessing of the raw ECoG data significantly improved the computational tractability of the neural network models.

Fig. 3. The tonal syllable decoding accuracy of different models.

Fig. 3.

(A and B) Bar plots showed the decoding accuracy (means ± SEM) of sequential (red bar) and parallel (blue bar) CNN-LSTM network using (A) syllable discriminative electrodes and (B) tone discriminative electrodes. The blue dashed line indicates the syllable chance level. The red dashed line indicates the tone chance level. *P < 0.05 and **P < 0.01; ns, nonsignificant; two-sided t test for independent samples. (C) Bar plots showed the decoding accuracy (means ± SEM) of the label generator, sequential and parallel CNN-LSTM network, CNN, and VGG16 model (bars are color-coded by models). *P < 0.05, **P < 0.01, ***P < 0.001, and ****P < 0.0001; compared to the accuracy of the label generator of the same subject, two-sided t test for independent samples. See Fig. 2 (B and C) for the architectures of sequential and parallel CNN-LSTM networks and the whole label generator.

However, the results were still not ideal for speech synthesis purposes. We hypothesized that a convolutional network with LSTM could be more effective in decoding the time series sequence compared to the network without sequential recurrence. Therefore, we designed a sequential CNN-LSTM model and a parallel CNN-LSTM model based on parallel CNN-LSTM network research in spectrum sensing (27) to test whether the recurrent network layer could increase decoding accuracy. However, there was no significant difference in decoding accuracy compared to the CNN (39.3 to 56.8%, n = 40 for each participant, P > 0.1, t test). Nonetheless, the parallel CNN-LSTM network enhanced the time-series learning ability of the model. Its decoding accuracy was between 45 and 58.4%, which showed a significant increase in participant 1 (PA1) and PA3 compared to the models described above. This result indicates that the introduction of LSTM could slightly increase accuracy. However, considering the limited data volume, a complex Recurrent Neural Network (RNN)/LSTM system might overfit quickly. Given these results, we preferred to optimize the feature engineering of input data rather than establish a complex network.

Inspired by the nature of distributed spatial coding of articulatory movements in the SMC (Fig. 1), we hypothesized that the input data should be divided into a minimal set of independent variables to predict the outcome at an acceptable level. This means that the tones and base syllables should be decoded separately as a dual stream. Next, we used the sequential and parallel CNN-LSTM networks to decode the tones and syllables separately, to determine the optimal model architecture. There was no significant difference in the syllable decoding accuracy between the sequential CNN-LSTM and parallel CNN-LSTM networks (Fig. 3A). However, their single epochs were 30 and 65 ms, respectively. Thus, we used the sequential CNN-LSTM network as the syllable label generator to decode the syllables. On the other hand, the parallel CNN-LSTM network achieved higher accuracy in tonal decoding in PA1, PA3, PA4, and PA5, compared to the sequential network (Fig. 3B; P < 0.05 in PA3 and P < 0.01 in PA1, PA4, and PA5, t test). This result showed that the parallel CNN-LSTM network learned better in features with more fine-grained differences. With this dual stream model, the mean accuracy of tonal syllable decoding increased to 55.7 to 75.6%, and the highest accuracy was 91.4% (PA5). Both the mean and highest accuracy were significantly higher than those of all baseline models (n = 40 for each participant, P = 0.02 in PA2 and P < 0.01 in PA1, PA3, PA4, and PA5, t test). This result suggests that tonal syllable decoding with limited data volume could be resolved by decoding the tones and syllables separately with small-scale networks.

To sum up, the label generator was designed as a decoding model that generated base syllable and tone labels separately through spatiotemporal decoding of ECoG signals. This design could significantly increase decoding efficiency with the limited ECoG data volume.

Overall speech synthesizing performance

After generating the labels, we next investigated whether the Mel spectrogram of speech could be directly decoded with a combination of the tone and syllable labels as well as the nondiscriminative speech responsive electrodes. To test the synthesis model’s efficiency, we used a combination of subjective and objective metrics. Many studies have reported spectral distortion of synthesized speech from ground truth samples using mean Mel cepstral distortion (MCD). The spectrogram contains features such as tone, fundamental frequency, and duration. Therefore, MCD can reflect the tonal differences (a lower MCD is better). For our five participants, the median MCD scores of decoded speech ranged from 2.67 to 3.19 dB (Fig. 4A), which was below the maximal value of 8 dB considered acceptable for voice recognition systems (28) (n = 32 for each participant, P < 1 × 10−6, Wilcoxon signed-rank test). For the four tones, the MCD ranged from 2.53 to 3.20 dB, which was also significantly less than 8 dB (Fig. 4B; n = 40 for each tone, P < 1 × 10−7, Wilcoxon signed-rank test). The mean MCDs among the four tones were different (P = 1 × 10−3, Kruskal-Wallis test), but there was no difference among participants (P = 0.09, Kruskal-Wallis test). We also used two different objective metrics to evaluate the system’s efficiency. One was the tone intelligibility assessment (IA), which was used to test whether naïve listeners could identify the correct tone in the speech sound. The mean IA of the four synthesized tones ranged from 81.7 to 92.3%, and the mean IA of the ground truth ranged from 85.3 to 93.1% (Fig. 4C). The mean IA and the ground truth were significantly correlated (Fig. 4D; Pearson’s correlation r = 0.70, n = 160, P = 3.15 × 10−25). In addition, we used the mean opinion score (MOS) to evaluate the sound quality of the decoded waveform. The MOS of the decoded soundwave was 3.86 ± 0.03 (means ± SEM), while the MOS of the ground truth was 4.30 ± 0.03 (means ± SEM). Although the score indicated a small amount of increased noise in the decoded speech compared to the ground truth, it is still an acceptable result given the limited training volume (Fig. 4E). The Pearson’s correlation r between the MOS of the synthesized speech and the ground truth was 0.68 (Fig. 4F; n = 160, P = 3.93 × 10−23).

Fig. 4. Evaluation of the synthesized speech sound quality.

Fig. 4.

(A) A pair of examples of the synthesized and original syllable sound spectrograms (compressed to an 80 × 44 matrix Mel spectrogram). Their MCD was 2.16 dB. (B) Bar plot showing the MCD (means ± SEM) of different lexical tones in five participants. (C) Bar plot showing the tone IAs (means ± SEM) of the synthesized sound wave and the ground truth by 31 listeners. (D) Scatterplot showing the correlation between the tone IAs of the synthesized sound wave and the ground truth (n = 160, Pearson’s correlation r = 0.70, P = 3.15 × 10−25). (E) Bar plot showing the MOS (means ± SEM) of the synthesized sound wave and the ground truth evaluated by 31 listeners. (F) Scatterplot showing the correlation between the MOS of the synthesized sound wave and the ground truth (n = 160, Pearson’s correlation r = 0.68, P = 3.93 × 10−23).

DISCUSSION

In this study, we proposed a modularized multistream neural network that directly synthesized the speech of tonal language from invasive neural recordings. By evaluating both subjective and objective quality metrics, we demonstrated that our proposed method achieved remarkably higher performance than classical baseline deep neural network (DNN) methods.

This study is the first attempt at speech sound synthesis and text decoding from ECoG in a tonal language. Previous studies of the motor control of speech production have mapped the laryngeal, lips, jaw, and tongue motor cortex through ECoG (12, 14, 29). On the basis of these findings in neural coding, DNN models were designed to decode the cortical activity of articulatory movement, turn the movement to the cepstral features, and synthesize speech (5). This bionic system simulated the physiological process from cortex to articulation. Most subsequent works were inspired by this system and tried to decode ECoG signals into words or short sentences using more complex neural networks (510). On the basis of these previous works, we conjecture that this bionic system can theoretically inspire tonal language decoding from ECoG. Although lexical tones have an equal influence on lexical meaning as nontonal syllables (30), the amplitude of tone discriminant signals in tone discriminative electrodes is significantly smaller than the syllable discriminant signals in syllable discriminative electrodes (fig. S1). This is likely due to the comparatively smaller trajectory amplitude of larynx movement in contrast to other articulatory organs such as the tongue, lips, and jaw. As a result, the coding of larynx movement has to be exaggerated and handled separately from other articulators (15, 31, 32). Thus, during speech decoding of tonal languages, it is necessary to focus on the spatiotemporal features of the tonal related motor cortex, especially the neural activity of the laryngeal motor cortex (12, 33).

Our experiment results confirm that the differences among tones detected by ECoG recordings are too weak to be accurately classified using general DNN algorithms without a specific network architecture that exploits the prior knowledge of neural coding. Our results also reveal the complex mechanism of tone production. Given the findings of this study (Fig. 1), these results suggest that tone information representation and acoustic-phonetic representation of base syllables rely on parallel and independent neural populations and different spatiotemporal coding patterns. Therefore, we designed parallel streams of neural network modules to independently decode lexical tones and base syllables using different computational architectures.

In addition to the dichotomy of lexical tones and base syllables, other acoustic parameters, such as timbre, sound intensity, duration, etc., do not contribute to tonal syllable discrimination but are still crucial for the integrity of the synthesized speech. Therefore, we integrated the nondiscriminative electrodes, representing the cortex participating in vocalization but not concerned with tonal syllable differentiation, and the labels from all the discriminative electrodes as the input layer to establish the synthesis model for the Mel spectrogram. This step imitates human articulation activity from the cortex to the mouth. As there are distinct spatiotemporal coding patterns for lexical tones and base syllables in the cortical language network, the coding of the articulation process in tonal language is supposed to be parallel and relatively independent. Therefore, we have modularized the neural decoding of tones and base syllables, as well as speech synthesis. This design reduces the number of electrodes fed into each stream of the model, reducing the computational cost. It is one of the most crucial differences between this study and previous speech BCI research (5, 9), which provides a strategy for decoding tonal language with BCI. This is the first attempt to apply modulization of the different steps in speech-language BCI. Prior knowledge that came directly from the neural coding properties is implemented as the model architecture, which is difficult to learn purely via inductive bias in end-to-end optimization with limited training data. In addition, because the electrodes are selected on the basis of their coding properties, the input data are a matrix in which the x axis is the time point of ECoG and the y axis is the selected electrode. As a result, the model is agnostic to the actual spatial locations of the electrodes and only considers the functional coding properties of the electrodes. In our model, each electrode represents a local neural population. The neighboring electrodes in the matrix have a strong functional relationship. They share the same functional category, instead of the spatial neighborhood in the brain. Therefore, we design all the convolutional filter sizes as (1,n) to keep the compression within one electrode instead of across several electrodes. The convolutional feature maps are integrated into the dense layer. In all, we establish a method for BCI speech synthesis, which involves selecting appropriate features at different stages of the neural network based on the distinct spatiotemporal coding characteristics of lexical tones and base syllables in vSMC. Our proposed method showed increased tone and syllable decoding accuracy and improved speech sound synthesizing performance, using an artificial neural network model of a moderate scale and computing cost.

Theoretically, this framework can be extended and applied to more general settings in tonal languages. The tonal syllable in most tonal languages and pitch-accent languages can be regarded as a constituent of tone/pitch-accent and base syllables. As described in Introduction, daily used Mandarin characters can be turned into a 4 × 416 tonal syllable matrix, where 4 is the number of tones and 416 is the number of base syllables. In our model, tone and base syllable decoding streams are parallel and independent. Our model is also applicable to other dialects of Chinese, such as Cantonese and Wu Chinese, where we need to extend the tone decoder module to decode seven or nine tones. Notably, the five-level tone marks are also applicable to these dialects, and most of the base syllables are also shared in the Chinese language system. This indicates that the model can be modified to be applied to the entire Chinese language system. Furthermore, a pitch-accent language, such as Japanese, can be regarded as a matrix of pitch drop by mora. Therefore, our model can be modified to a binary-output pitch generator and a 111-category mora decoder to be applied to Japanese. A more challenging task is the naturalistic speech of tonal languages, where the actual acoustic cues of lexical tones are highly variable both across speakers and utterances (34). Our single word decoding model can be used as an essential building block in naturalistic speech settings, and language models can be used to improve model performance.

In conclusion, this study explored the possibility of tonal language speech synthesis from neural decoding with high-density ECoG. With limited data obtained during awake surgery, we designed a multistream modulization neural network to decode the tones and syllables collected by the discriminators separately. We integrated them with the nondiscriminative speech electrodes to generate a Mel spectrogram of speech. Such neural network design is optimized on the basis of the specific neural coding characteristics presented in vSMC. This study highlights the importance of feature engineering in BCI models, especially in scenarios with limited training data. Our research offers a potential solution for restoring speech using BCI technology for patients with dysarthria or aphasia in Chinese and other tonal languages.

MATERIALS AND METHODS

Participants

This research involved five participants (four males aged 44, 53, 39, and 54 and one female aged 37) who underwent awake language mapping during their brain tumor surgeries at Huashan Hospital in Shanghai, China. Two 128-channel high-density subdural electrode arrays were temporarily placed on the lateral surface of their brain during the surgery to record neural activity. An experienced neurosurgeon placed the grid, ensuring clinical exposure and avoiding the tumor area. The study received approval from the Huashan Hospital Institutional Review Board of Fudan University (KY2017-437) and was conducted complied all relevant ethical regulations. Participants were invited to join the study only if there was a clinical necessity to perform awake surgery for safe resection of the tumor and protection of the eloquent area. Before the surgery, the surgeon informed the participants that their participation in the research was completely voluntary and that the task was for research purposes. Each participant was fully informed and provided with a written inform consent.

Experimental task

At the beginning of each trial, the participant was instructed by an audio cue to produce one of the eight syllables “ma (tone 1), ma (tone 2), ma (tone 3), ma (tone 4), mi (tone 1), mi (tone 2), mi (tone 3), and mi (tone 4)” following the audio go cue. In each trial, the participant was instructed to repeat the tonal syllable three times. The eight syllables were repeated and presented in random order. Each participant performed 160 trials (20 trials per syllable × 8 syllables). Each syllable was required to be repeated three times in a single trial, which yielded 60 repetitions (20 trials × 3 repetitions per trial) per tonal syllable. Audio recordings were obtained synchronously with the ECoG recordings using a mounted microphone.

Data acquisition and signal processing

ECoG and audio sound were recorded simultaneously by the Tucker-Davis Technologies ECoG system. Two 128-channel electrode arrays were connected to an amplifier. ECoG signals were recorded at a sampling rate of 3052 Hz. The researcher visually and quantitatively inspected the channels for artifacts or excessive noise (typically 50-Hz line noise). The high-γ frequency component (70 to 150 Hz) was extracted via Hilbert transform and down-sampled to 200 Hz. Then, the data across different recording sessions were z-scored relative to the mean and SD of a 1-s resting-state baseline. Audio sound waves were recorded at a sampling rate of 24,414 Hz.

Phonetic and phonological transcription

Transcriptions and labels, including consonants, vowels, and tones, were manually marked with Praat software (version 6.1.01, www.fon.hum.uva.nl/praat/) at the syllable level to ensure the label reflected the actual sound the participant produced according to the audio recording. All the sound waves and ECoG signals were cut into 1-s snippets, from −200 to 800 ms, relative to the onset of the consonants. An 80 × 44 Mel spectrogram was computed for each audio snippet.

Cortical surface extraction and electrode visualization

The electrodes on each individual’s brain were marked by the BrainLab neuronavigation system (Brainlab AG, Munich, Germany) via preoperative T1 magnetic resonance imaging and validated via intro-operative photos by a neurosurgeon. The cerebral surface was reconstructed using FreeSurfer, while anatomical labeling and plotting were performed via customized Python code (35).

Mel cepstral distortion

To examine the quality of synthesized speech sound waves, we use MCD as the objective index. MCD is an acoustic quality measurement of error determined from Mel Frequency Cepstral Coefficients (MFCCs) (36). MCD was calculated as follows (28)

MCD(dB)=10ln102d=124[mcd(y^)mcd(y)]2

In this equation, d corresponded to each MFCC dimension (0 < d < 25), y^ was the synthesized speech, and y was the actual acoustic sound the participant produced.

Tone IA

Listening tests using crowdsourcing are standard for evaluating the synthesis quality in natural language processing. We randomized the synthesized sound and extracted synthesized audio. The presented options included all syllables. The evaluators were instructed to listen to the audio and choose from the options which tonal syllable they had just heard. We recruited native Mandarin speakers located in mainland China as the evaluators. Tone IA was computed as follows

IA=1TA

where T is the number of error tone choices and A is the number of total tests.

Mean opinion score

In telecommunications, the MOS is a metric of the voice quality, judged on a scale of 1 to 5. The levels of the metric are 5 (complete relaxation possible; no effort required), 4 (attention necessary; no appreciable effort required), 3 (moderate effort required), 2 (considerable effort required), and 1 (no meaning understood with any feasible effort) (37, 38).

We recruited 31 independent participants to take part in the MOS test according to the following criteria: (i) Mandarin native speaker with a high school degree or above; (ii) without previous exposure to this research project before the recruitment; (iii) no conflict of interests with the research team; (iv) patients, patient’s family members, or hospital employees were excluded.

We randomized the 160 synthesized and 160 ground-truth sounds and then played them to the participants. All participants were required to listen to the sounds and write down the MOS for each audio snippet. The average score for each synthesized sound and each ground truth sound was computed in each participant.

Speech-responsive electrodes selection

To define the speech-responsive electrodes, we first aligned the high-γ response to the onsets of acoustic activity. Onsets were defined as the time at the beginning of the consonant, preceded by more than 400 ms of silence. A speech responsive electrode met the following two criteria: (i) The average speech response after the consonant acoustic activity was significantly different from the responses at 0 ms, and (ii) the average response after speech onset (window from 100 ms before onset to 600 ms after onset, accounting for a neural delay of ~100 ms) was significantly different from the response before the beginning of the speech group (−300 to −200 ms) (both P < 0.01, t test, one sample/paired two-sided, Bonferroni corrected for the total number of electrodes).

Tone discriminative electrodes selection

To define speech-responsive electrodes that were also discriminative between lexical tones, we first aligned the high-γ responses to the tone onsets. We then used the F test to test whether the mean high-γ responses of the four Mandarin tones differed significantly. The time window of the average response was −200 to 800 ms relative to the onset (200 total time points). The significant time points were determined with a two-sided P < 0.05 threshold using Bonferroni correction for the total number of electrodes and time points.

Syllable discriminative electrodes selection

Similar to the tone discriminative electrodes, the syllable discriminative electrodes were defined using a t test approach. We tested whether the mean high-γ responses of the two syllables were significantly different. The time window of the average response was −200 to 800 ms relative to the onset (200 total time points). The significant time points were determined with a two-sided P < 0.05 threshold using Bonferroni correction for the total number of electrodes and time points.

Tone decoder

The decoder mapped ECoG recordings to tone labels. Tone labels were transcribed using the five-level tone marks, which are applicable to most tonal languages. This mark reflects the contour of relative pitch height in different tones. We transcribed the marks into a 1D array with five elements to normalize the decoder’s output length. The first tone (flat tone) was transcribed as (5,5,5,5,5), the second tone (rising tone) was (3,3.5,4,4.5,5), the third tone (dip tone) was (2,1,2,3,4), and the fourth tone (falling tone) was (5,4,3,2,1).

The neural network model was implemented using the Keras functional application programming interface on TensorFlow. The input was the aforementioned z-scored high-γ ECoG snippets with dimensions N × T, where N is the number of electrodes and T is the down sampled ECoG (200 Hz). In the first stage, one-layer LSTM and a two-layer convolution and max pooling structure processed the ECoG sequence independently. The output of LSTM was a 1 × 800 array and was reshaped to a 4 × 200 matrix. By concatenating the LSTM output and the two-layer convolution and max pooling structure output, the data were fed into the second stage to generate the correct tone label. A 50% dropout layer was added in this stage to avoid overfitting. All the kernel sizes of the max-pooling layer in the decoder were 1 × m, where m is the pool size of time points and 1 is the dimension of electrodes to ensure that the maximization was taken over the same electrode. The stacked network was optimized to minimize the joint mean square error loss using the Nadam optimizer, with an initial learning rate set at 0.0005, β1 = 0.9, β2 = 0.999, and ε = 1 × 10−8. Training of the models was stopped after the validation loss no longer decreased. We used an independent LSTM layer in the first stage to enhance the learning of weak differences between tones.

Syllable decoder

The syllable decoder was a six-layer convolution and one-layer LSTM neural network. A 50% dropout layer was added in this stage to avoid overfitting. All the activation layers of the CNN in the decoder were leaky rectified linear units (ReLUs) (39). The whole network was trained to minimize the joint cross-entropy loss using the Nadam optimizer, with an initial learning rate set at 0.0005, β1 = 0.9, β2 = 0.999, and ε = 1 × 10−8. The optimization was stopped after the validation loss no longer decreased.

Mel spectrogram synthesis model

The synthesis network combined two independent streams of inputs and generated a synthesized Mel spectrogram. The first input stream was a combination of the outputs of the tone decoder and the syllable decoder. The first input stream was a combination of the outputs of the tone decoder and the syllable decoder, so-called tonal syllable labels. The labels were regarded as a sequence and processed by an LSTM layer (Fig. 2D, green rectangle). The second input stream took in all the other speech response electrodes without tone and syllable discriminants. These electrodes were supposed to represent the articulation features, excluding the tone and syllable information. The second input stream was processed by a five-layer convolution and max pooling structure. A 50% dropout layer was added after the network to avoid overfitting. We then concatenated the LSTM output and the network output and input it to a five-layer CNN to generate the Mel spectrogram. All the activation functions of the CNN were leaky ReLUs with α = 0.01. The network was optimized to minimize the mean absolute error loss with L2 normalization using the Nadam optimizer, with an initial learning rate set at 0.001, β1 = 0.9, β2 = 0.999, and ε = 1 × 10−8. Training of the models was stopped after the validation loss no longer decreased.

Vocoder

The vocoder synthesized a 24,414-Hz sound wave from the 80 × 44 Mel spectrogram using the Griffin-Lim algorithm (40) implemented in the Python librosa package (41).

Acknowledgments

Funding: This work was supported by STI 2030-Major Projects (2022ZD0212300), Shanghai Pujiang Program (21PJD007 and 22PJ1410500), Shanghai Rising-Star Program (19QA1401700), Innovation Program of Shanghai Municipal Education Commission (2023ZKZD13), and Shanghai Municipal Science and Technology Major Project (2018SHZDZX01).

Author contributions: Conceptualization: J.L., Y. Li, and J.W. Methodology: J.L., Y. Li, Z.Z., M.X., and Y. Liu. Data collection: J.L., Y. Liu, Y.Z., J.Z., L.B., X.Z. and Z.Z. Data analyzed: Y. Liu, Z.Z., H.Y., and M.X. Supervision: J.W., D.M., J.L. and Y.Li. Writing—original draft: Y. Liu, Z.Z., Y. Li, and J.L. Writing—review and editing: Y. Liu, Z.Z., M.X., H.Y., Y.Z., J.Z., L.B., X.Z., J.L., Y. Li, D.M., and J.W.

Competing interests: The authors declare that they have no competing interests.

Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. The source data to recreate the manuscript figures, code of deep learning model, and training data are provided with this publication in the Science Data Bank: https://doi.org/10.57760/sciencedb.07999. According to the requirements of China’s Administration of Human Genetic Resources and Huashan Hospital Institutional Review Board of Fudan University, all the applicants should provide their name, institute information, and purpose to J.W. (wujinsong@huashan.org.cn) for registration to use the data in further research.

Supplementary Materials

This PDF file includes:

Fig. S1

Legend for audio S1

Other Supplementary Material for this manuscript includes the following:

Audio S1

View/request a protocol for this paper from Bio-protocol.

REFERENCES AND NOTES

  • 1.M. Yip, Tone (Cambridge Univ. Press, 2002). [Google Scholar]
  • 2.M. S. Dryer, M. Haspelmath, The world atlas of language structures online (2013); http://wals.info [Accessed on 2021 September 21].
  • 3.J. D. McCawley, What is a tone language? in Tone (Elsevier, 1978), pp. 113–131. [Google Scholar]
  • 4.Y.-R. Chao, A System of Tone Letters (Le maître phonétique, 1930). [Google Scholar]
  • 5.Anumanchipalli G. K., Chartier J., Chang E. F., Speech synthesis from neural decoding of spoken sentences. Nature 568, 493–498 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Moses D. A., Leonard M. K., Makin J. G., Chang E. F., Real-time decoding of question-and-answer speech dialogue using human cortical activity. Nat. Commun. 10, 3096 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Makin J. G., Moses D. A., Chang E. F., Machine translation of cortical activity to text with an encoder-decoder framework. Nat. Neurosci. 23, 575–582 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Moses D. A., Metzger S. L., Liu J. R., Anumanchipalli G. K., Makin J. G., Sun P. F., Chartier J., Dougherty M. E., Liu P. M., Abrams G. M., Tu-Chan A., Ganguly K., Chang E. F., Neuroprosthesis for decoding speech in a paralyzed person with anarthria. N. Engl. J. Med. 385, 217–227 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Angrick M., Herff C., Mugler E., Tate M. C., Slutzky M. W., Krusienski D. J., Schultz T., Speech synthesis from ECoG using densely connected 3D convolutional neural networks. J. Neural Eng. 16, 036019 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.S. Komeiji, K. Shigemi, T. Mitsuhashi, Y. Iimura, H. Suzuki, H. Sugano, K. Shinoda, T. Tanaka, Synthesizing speech from ECoG with a combination of transformer-based encoder and neural vocoder, in Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (IEEE, 2022), pp. 1311–1315. [Google Scholar]
  • 11.Fager S. K., Fried-Oken M., Jakobs T., Beukelman D. R., New and emerging access technologies for adults with complex communication needs and severe motor impairments: State of the science. Augment. Altern. Commun. 35, 13–25 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Bouchard K. E., Mesgarani N., Johnson K., Chang E. F., Functional organization of human sensorimotor cortex for speech articulation. Nature 495, 327–332 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Conant D. F., Bouchard K. E., Leonard M. K., Chang E. F., Human sensorimotor cortex control of directly measured vocal tract movements during vowel production. J. Neurosci. 38, 2955–2966 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Chartier J., Anumanchipalli G. K., Johnson K., Chang E. F., Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron 98, 1042–1054.e4 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Dichter B. K., Breshears J. D., Leonard M. K., Chang E. F., The control of vocal pitch in human laryngeal motor cortex. Cell 174, 21, 31.e9 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ge J., Gao J.-H., A review of functional MRI application for brain research of Chinese language processing. Magn. Reson. Lett. 3, 1–3 (2023). [Google Scholar]
  • 17.Ge J., Peng G., Lyu B., Wang Y., Zhuo Y., Niu Z., Tan L. H., Leff A. P., Gao J.-H., Cross-language differences in the brain network subserving intelligible speech. Proc. Natl. Acad. Sci. U.S.A. 112, 2972–2977 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.J. Tao, F. Zheng, A. Li, Y. Li, in 2009 Oriental COCOSDA International Conference on Speech Database and Assessments (IEEE, 2009), pp. 13–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.R. Levy, C. D. Manning, Is it harder to parse chinese, or the chinese treebank? in Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2003), pp. 439–446. [Google Scholar]
  • 20.P.-C. Chang, H. Tseng, D. Jurafsky, C. D. Manning, Discriminative reordering with chinese grammatical relations features, in Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation (SSST-3) at NAACL HLT 2009, Boulder, Colorado, 2009 June, pp. 51–59.
  • 21.A. Michaud, in 8th International Seminar on Speech Production (ISSP'08), Strasbourg, France, 8 to 12 December 2008, pp. 13–18.
  • 22.Livezey J. A., Bouchard K. E., Chang E. F., Deep learning as a tool for neural data analysis: Speech classification and cross-frequency coupling in human sensorimotor cortex. PLOS Comput. Biol. 15, e1007091 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Crone N., Hao L., Hart J., Boatman D., Lesser R., Irizarry R., Gordon B., Electrocorticographic gamma activity during word production in spoken and sign language. Neurology 57, 2045–2053 (2001). [DOI] [PubMed] [Google Scholar]
  • 24.Ray S., Maunsell J. H., Different origins of gamma rhythm and high-gamma activity in macaque visual cortex. PLOS Biol. 9, e1000610 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.G. Zhang, L. Yu, C. Wang, J. Wei, in ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2022), pp. 9122–9126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [cs.CV] (4 September 2014).
  • 27.M. Xu, Z. Yin, M. Wu, Z. Wu, Y. Zhao, Z. Gao, in 2020 IEEE 91st Vehicular Technology Conference (VTC2020-Spring) (IEEE, 2020), pp. 1–5. [Google Scholar]
  • 28.C. S. Group (2012); http://festvox.org/11752/slides/lecture11a.pdf.
  • 29.Hill N. J., Gupta D., Brunner P., Gunduz A., Adamo M. A., Ritaccio A., Schalk G., Recording human electrocorticographic (ECoG) signals for neuroscientific research and real-time functional cortical mapping. J. Vis. Exp., e3993 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Hyman L. M., Lexical vs. grammatical tone: Sorting out the differences. Tonal. Aspects Lang. 2016, 6–11 (2016). [Google Scholar]
  • 31.Kuang J., The tonal space of contrastive five level tones. Phonetica 70, 1–23 (2013). [DOI] [PubMed] [Google Scholar]
  • 32.Brumberg J. S., Krusienski D. J., Chakrabarti S., Gunduz A., Brunner P., Ritaccio A. L., Schalk G., Spatio-temporal progression of cortical activity related to continuous overt and covert speech production in a reading task. PLOS ONE 11, e0166872 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Martin S., Brunner P., Holdgraf C., Heinze H.-J., Crone N. E., Rieger J., Schalk G., Knight R. T., Pasley B. N., Decoding spectrotemporal features of overt and covert speech from the human cortex. Front. Neuroeng. 7, 14 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Li Y., Tang C., Lu J., Wu J., Chang E. F., Human cortical encoding of pitch in tonal and non-tonal languages. Nat. Commun. 12, 1161 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Hamilton L. S., Chang D. L., Lee M. B., Chang E. F., Semi-automated anatomical labeling and inter-subject warping of high-density intracranial recording electrodes in electrocorticography. Front. Neuroinform. 11, 62 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.R. Kubichek, in Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing (IEEE, 1993), vol. 1, pp. 125–128. [Google Scholar]
  • 37.I. Rec, “P. 800.1: Mean opinion score (MOS) terminology” (International Telecommunication Union, 2006). [Google Scholar]
  • 38.I. Rec, “P. 800: Methods for subjective determination of transmission quality” (International Telecommunication Union, 1996). [Google Scholar]
  • 39.A. L. Maas, A. Y. Hannun, A. Y. Ng, in Proceedings of the 30th International Conference on Machine Learning . (Atlanta, Georgia, USA, 2013), vol. 30, pp. 3. [Google Scholar]
  • 40.Griffin D., Lim J., Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. 32, 236–243 (1984). [Google Scholar]
  • 41.B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, O. Nieto, in Proceedings of the 14th Python in Science Conference (SciPy, 2015), vol. 8, pp. 18–25. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Fig. S1

Legend for audio S1

Audio S1


Articles from Science Advances are provided here courtesy of American Association for the Advancement of Science

RESOURCES