Skip to main content
PeerJ Computer Science logoLink to PeerJ Computer Science
. 2025 Jun 23;11:e2938. doi: 10.7717/peerj-cs.2938

Systematic review: progress in EEG-based speech imagery brain-computer interface decoding and encoding research

Ke Su 1,, Liang Tian 2
Editor: Davide Chicco
PMCID: PMC12453656  PMID: 40989443

Abstract

This article systematically reviews the latest developments in electroencephalogram (EEG)-based speech imagery brain-computer interface (SI-BCI). It explores the brain connectivity of SI-BCI and reveals its key role in neural encoding and decoding. It analyzes the research progress on vowel-vowel and vowel-consonant combinations, as well as Chinese characters, words, and long-words speech imagery paradigms. In the neural encoding section, the preprocessing and feature extraction techniques for EEG signals are discussed in detail. The neural decoding section offers an in-depth analysis of the applications and performance of machine learning and deep learning algorithms. Finally, the challenges faced by current research are summarized, and future directions are outlined. The review highlights that future research should focus on brain region mechanisms, paradigms innovation, and the optimization of decoding algorithms to promote the practical application of SI-BCI technology.

Keywords: EEG, Speech imagery, Brain computer interface, Connectivity of brain regions, Neural decoding algorithms, Machine learning, Deep learning, Speech imagery brain-computer interface paradigms, Neural encoding technology

Introduction

A brain-computer interface (BCI) refers to a technology that establishes a direct connection between the human brain and external devices. By detecting and decoding brain activity, human intentions can be transmitted directly to these devices, enabling communication between the brain and external equipment (Wolpaw, Millan & Ramsey, 2020). Based on the need for external stimulation, BCIs are primarily categorized into evoked neurocognitive paradigms and spontaneous neurocognitive paradigms. This study focuses on one type of spontaneous psychological paradigm—speech imagery. Speech imagery refers to mentally articulating words without actual vocalization or facial movement (Liu et al., 2022). This phenomenon engages neural mechanisms related to cognition, memory, learning, and thought processes. The integration of these neural mechanisms with BCI technology forms a speech imagery-based BCI, which detects neural signals associated with speech processes through external recording devices and decodes the brain’s speech intentions (which have been encoded into neural signals). These decoded signals can then be used to generate commands for controlling external devices, thus facilitating interaction with external systems (Lee, Lee & Lee, 2020; Pawar & Dhage, 2022; Qureshi et al., 2017).

In recent years, speech imagery-based brain-computer interface (SI-BCI) has emerged as a prominent research focus, with numerous researchers conducting extensive studies from various perspectives. Generally, these studies can be categorized into four main areas: the neural mechanisms of brain regions during speech imagery, the design of speech imagery paradigms, speech imagery encoding techniques, and speech imagery decoding techniques. Among these, the neural mechanisms of brain regions constitute the physiological foundation for speech imagery. The speech imagery paradigm is an experimental task designed based on these neural mechanisms, while speech imagery encoding and decoding refer to technological processes that facilitate the translation of human intentions into commands within this paradigm.

Currently, research on the neural mechanisms of brain regions primarily focuses on analyzing the overlap in activation patterns across different brain regions during between actual speech and speech imagery states using various brain signal acquisition techniques. Lu et al. (2023) employed functional magnetic resonance imaging (fMRI) to compare the neural mechanisms of actual speech perception and speech imagery. They found that both states shared activation in the bilateral superior temporal gyrus (STG) and supplementary motor area (SMA). However, speech imagery specifically enhanced activity in the left inferior frontal gyrus (Broca’s area), revealing the neural basis of the motor-simulation-dependent neural basis of internal language generation. Beyeler et al. (2016) applied optogenetic labeling combined with electrophysiological recording techniques to dissect the encoding differences in neuronal populations projecting from the basolateral amygdala (BLA) to the nucleus accumbens (NAc), central amygdala (CeA), and ventral hippocampus (vHPC) during the retrieval of positive and negative valence memories. They discovered that BLA-NAc projections preferentially encode positive valence cues, BLA-CeA projections favor negative valence cues, and BLA-vHPC projections exhibit balanced responses to both valences, with more pronounced activation during speech imagery. These findings offer foundational scientific insights for future research on neural decoding of speech imagery. Peña Serrano, Jaimes-Reátegui & Pisarchik (2024) constructed an event-related coherence-based magnetoencephalography (MEG) hypergraph network analysis technique to reveal the dynamic patterns of functional connectivity across different brain regions (frontal, parietal, temporal, and occipital lobes) at various frequency bands (delta, theta, alpha, beta, and gamma) during different perceptual tasks. Their study found differences in activation in the left inferior frontal region between spontaneous and evoked speech imagery, with task-evoked speech imagery significantly enhancing activation in this region, while spontaneous speech imagery exhibited relatively reduced activation (Hurlburt et al., 2016). Rekrut, Selim & Krüger (2022) designed a novel training method that successfully applied transfer learning to train a silent speech classifier on overt speech EEG data and subsequently applied it to covert speech data, demonstrating the feasibility of transfer learning in silent speech BCIs. From these prior studies, it is evident that while there are connections between speech imagery and actual speech, applying conclusions derived from actual speech directly to speech imagery remains a subject of debate. The activation states of brain regions during the processing of different types of speech vary, necessitating more detailed regional brain analysis.

Research trends in speech imagery paradigm design have evolved from initial focuses on vowels and vowel–consonant combinations to more complex units such as words, Chinese characters, and potentially short sentences. For instance, DaSalla et al. (2009) and Idrees & Farooq (2016a) developed paradigms centered around the vowels /a/ and /u/. Meanwhile, Wang et al. (2021) and Huang et al. (2022) have conducted extensive, long-term studies on paradigms involving Chinese characters such as “左(left)”, “移(move)”, “壹(one)”, “右(right)”. Liu et al. (2022) offered a comprehensive review of experimental paradigms, providing in-depth analyses of various design strategies. However, their discussion of signal decoding techniques lacked depth and comprehensive analysis.

In the field of speech imagery encoding, research has primarily focused on artifact removal during signal acquisition and algorithmic enhancements in feature extraction. Chen & Pan (2020) investigated the application of speech imagery in brain–computer interaction, emphasizing innovations in signal acquisition and processing methods. Schultz et al. (2017) described various physiological signals involved in speech production and their recording techniques, offering valuable insights into the biological basis of speech. Cooney, Folli & Coyle (2018) focused on the physiological foundations of speech, providing a detailed analysis of speech production processes. However, these studies primarily addressed physiological mechanisms and signal acquisition and encoding, while lacking systematic reviews of signal decoding methods and experimental paradigms. Alzahrani, Banjar & Mirza (2024) focused on classifying command-based vocabulary (e.g., “up,” “down,” “left,” “right”) for BCIs, examining the effects of feature extraction techniques—including deep learning, adaptive optimization, and frequency-specific decomposition—on classification performance. They compared traditional machine learning techniques with deep learning approaches and discussed the influence of brain lateralization in imagined speech. Lopez-Bernal et al. (2022) reviewed relevant literature published since 2009, emphasizing electroencephalogram (EEG) signal preprocessing, feature extraction, and classification techniques. These two studies offer more comprehensive comparisons between traditional machine learning methods and deep learning approaches.

Speech imagery decoding has long been a central focus in the development of SI-BCI systems. Recent trends in this field emphasize the use of machine learning and deep learning algorithms to enhance decoding accuracy. Rahman et al. (2024) conducted a comprehensive comparison of classification algorithms, including support vector machines (SVM), random forests (RF), and various deep learning techniques. Zhang et al. (2024) analyzed multiple deep learning frameworks and, in contrast to earlier work, provided an in-depth examination of the EEG datasets employed in various studies. These efforts underscore ongoing progress in utilizing advanced computational methods to improve the precision and reliability of speech imagery decoding in SI-BCI systems.

Despite existing reviews covering the aforementioned four areas, a significant gap persists—namely, the absence of a comprehensive and systematic analysis of recent advancements in BCI paradigm definition and classification, SI-BCI neural encoding, and neural decoding, all grounded in the neural mechanisms of speech imagery. Panachakel & Ramakrishnan (2021) presented a 2021 review summarizing various methods for decoding imagined speech from EEG signals developed over the past decade. Their review covered key aspects such as data acquisition, signal processing, feature extraction, and classification methods, while also offering preliminary insights into the relationship between speech imagery and brain regions. However, with the rapid evolution of technology and theory, their study has become outdated and does not meet the requirements of current research. In addition, we have assessed the advantages and limitations of various signal acquisition methods—including EEG, MEG, functional near-infrared spectroscopy (fNIRS), and fMRI—based on practicality, research volume, cost, and generalizability. We conclude that EEG is currently the most suitable approach and have accordingly identified EEG signal acquisition as a key prerequisite for this study (Panachakel & Ramakrishnan, 2021; Kaongoen et al., 2023).

This review aims to explore recent advancements in EEG-based SI-BCI encoding and decoding by providing a systematic and comprehensive overview. It focuses on dissecting SI-BCI paradigms, neural encoding, and neural decoding, with particular attention to existing challenges, emerging trends, and future prospects (the data collection process is shown in Fig. 1). First, brain region connectivity is discussed as the physiological foundation of SI-BCI systems. Next, existing SI-BCI paradigms are analyzed based on the characteristics of brain region connectivity, followed by a detailed exposition of relevant neural encoding and decoding methods. Finally, key challenges and potential developmental directions are summarized. The key research contributions of this review are outlined as follows.

Figure 1. PRISMA flowchart of system review process.

Figure 1

1. To examine brain region connectivity in SI-BCI, establishing a physiological foundation for the development of paradigms, neural encoding, and decoding techniques.

2. To clarify the relationship between SI-BCI paradigms and neural encoding/decoding, and to elaborate on recent advancements across different paradigms.

3. To explain neural encoding in SI-BCI, including EEG preprocessing and feature extraction techniques.

4. To present a detailed discussion on neural decoding, including available datasets, participant numbers, applied machine learning and deep learning algorithms, and a comprehensive model evaluation.

5. To identify challenges in SI-BCI—including those related to signal acquisition quality, brain region connectivity, and neural encoding—while highlighting future directions such as word and sentence-level decoding and paradigm integration.

The structure of this review is outlined as follows (Fig. 2). “Introduction” provides an introduction to the background of the problem, outlines the motivation for the study, and emphasizes the distinctive features of this article. “Connectivity of Brain Regions In SI-BCI” details the literature search strategy and specifies the inclusion criteria. “SI-BCI Paradigm” systematically explores brain region connectivity in SI-BCI (He et al., 2019). “SI-BCI Neural Encoding” defines SI-BCI paradigms and discusses various types of paradigm structures. “SI-BCI Neural Decoding” analyzes neural encoding techniques, focusing on EEG signal preprocessing and feature extraction. “Challenges and Prospects” offers a comprehensive discussion of neural decoding techniques (Kristensen, Subhi & Puthusserypady, 2020), including datasets, participant numbers, machine learning (ML) and deep learning (DL) algorithms, and a model performance evaluation table. “Conclusions” addresses current challenges in the field and explores potential future research directions.

Figure 2. Research Framework.

Figure 2

Survey methodology

This review utilized a method called PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses), which is a protocol for conducting systematic reviews and analyses (Page et al., 2021). By using PRISMA to identify and locate relevant studies, the data collection process for this review was significantly streamlined.

The search was completed on March 8, 2025. We conducted queries using the following keywords on Web of Science, Scopus, IEEE Xplore, PubMed, ScienceDirect, and Google Scholar to gather relevant literature: (“brain-computer interface” AND (“speech imagery” OR “covert imagery” OR “inner speech” OR “imagined speech”) AND (“deep learning” OR “machine learning” OR “DL” OR “ML”) AND (“electroencephalogram” OR “EEG”)). These keywords were sufficient to retrieve all pertinent studies.

Studies were excluded based on the following criteria:

After retrieving the search results, we excluded duplicate articles across databases and then applied the following criteria to filter out unsuitable studies from the remaining articles. Subsequently, we thoroughly read the full texts of the remaining studies.

(1) EEG-only studies: Studies using multimodal datasets were excluded.

(2) Language: Articles not written in English were excluded.

(3) Task type: Neural encoding and decoding tasks not exclusively based on EEG speech imagery were excluded.

(4) Time frame: Studies published before 2015 were excluded.

Ultimately, 125 highly relevant articles were selected for this literature review analysis. The distribution of these articles across brain connectivity, paradigms, neural encoding, and neural decoding algorithms (machine learning and deep learning) is illustrated in Fig. 3.

Figure 3. Composition of the article.

Figure 3

Connectivity of brain regions in SI-BCI

Speech imagery, as a mental activity, involves the coordinated activity of multiple brain regions. These regions are closely associated with language production and comprehension, particularly Broca’s area and Wernicke’s area (Jäncke, Liem & Merillat, 2021; Ono et al., 2022). Brain region connectivity describes the information exchange and interactions between these regions, playing a key role in both neural encoding and decoding processes. In neural encoding, connectivity reflects the transmission and collaboration of information between brain regions, supporting the encoding and integration of complex information. In neural decoding, the use of connectivity features can improve decoding performance and provide deeper insights into brain function mechanisms. In the context of SI-BCI, brain region connectivity is primarily classified into functional connectivity and effective connectivity.

Research on brain functional connectivity in speech imagery tasks is continuously advancing to reveal the collaborative workings of different brain regions during specific tasks or resting states. Sandhya et al. (2015) observed bilateral interactions between the frontal and temporal regions of the brain during speech imagery by calculating the coherence of EEG signals and applying a multivariate autoregressive (MVAR) model. Notably, high inter-electrode coherence was observed in the left frontal lobe during language production and in the left temporal lobe during speech imagery, highlighting the proximity of these regions to Broca’s area and Wernicke’s area. This provides a multidimensional perspective on understanding brain functional connectivity in speech imagery, laying the foundation for subsequent research. With continued research, Chengaiyan & Anandhan (2015), Chengaiyan et al. (2018) further analyzed correlation parameters of EEG signals and found that a single parameter is insufficient for comprehensively understanding brain region connectivity. To address this, they calculated connectivity parameters such as EEG coherence, partially directed coherence, and directed transfer function, examining concurrent and directional connectivity patterns across different brain regions during speech production and speech imagery. Their findings indicated active engagement of the left frontal lobe during speech production and the left temporal lobe during silent word imagery, aligning with the roles of Broca’s and Wernicke’s areas and corroborating the results of Sandhya et al. (2015). Additionally, phase locking value (PLV) calculations revealed the strongest phase synchronization between the left frontal and left temporal lobes in the alpha and theta bands, suggesting a close collaboration between these areas during speech/speech imagery. Bisla & Anand (2023) conducted a connectivity analysis of speech imagery using the Kara One dataset, identifying key neurophysiological dynamics in this paradigm. Notably, this was the first detailed exploration of neural connectivity related to speech imagery, highlighting distinct connectivity patterns through phase-based metrics and simplifying future data processing approaches.

To improve the accuracy of speech imagery recognition, Chengaiyan & Anandan (2022) innovatively combined brain connectivity metrics with machine learning methods, applying functional and effective connectivity analysis to a vowel recognition task, achieving over 80% recognition accuracy. This further demonstrates the value of brain connectivity analysis in understanding the mechanisms of speech cognition. However, traditional machine learning has limitations in studying brain region connectivity in speech imagery BCI research. It relies on manually designed feature extraction, which is time-consuming and susceptible to subjective influence. Traditional models also struggle to capture complex nonlinear relationships in data, especially when simulating intricate interactions among brain regions. Additionally, these models have limited capabilities with large-scale datasets and real-time applications, with weaker generalization. In contrast, deep learning addresses these limitations. It can automatically learn and extract features from raw data, reducing the need for manual feature engineering. Deep learning excels at capturing nonlinear patterns in data, enhancing prediction accuracy and providing strong generalization capabilities. Park, Yeom & Sim (2021) were the first to introduce deep learning into the field of brain connectivity, using mutual information as a measure of brain connectivity and applying convolutional neural networks (CNNs) for user state recognition tasks based on brain connectivity. By enabling CNNs to automatically learn connectivity features, they achieved adaptive control in BCI systems, offering a novel approach to developing adaptive BCI systems.

Effective connectivity is used to analyze the direct impact or causal relationships between activities in different brain regions, helping researchers understand the direction and dynamic processes of information flow in the brain (Belaoucha & Papadopoulo, 2020). Research primarily focuses on the design of brain connectivity estimators. Sandhya et al. (2015), using brain connectivity estimators, analyzed causal correlations and found that the frontal and temporal regions of the left hemisphere were more active during specific speech imagery tasks. This represents significant progress in understanding neural interactions during thought and expression processes. Panachakel & Ramakrishnan (2021) used the mean phase coherence (MPC) as an indicator of cortical region phase synchrony and found that β-band MPC values differed significantly between nasal and bilabial consonant imagery. This suggests that different speech imagery types lead to differences in effective connectivity between cortical regions. Chengaiyan, Retnapandian & Anandan (2020), Ahn et al. (2022) applied deep learning to analyze effective connectivity in various brain regions during speech production and speech imagery, introducing an attention-based dual-modal fusion mechanism, which provides an innovative EEG signal analysis method for the fields of BCI and cognitive computing.

The aforementioned studies demonstrate that research on the neural mechanisms of speech imagery tasks focuses on the analysis of functional and effective connectivity between brain regions. This is primarily achieved through coherence, phase synchrony, and causal connectivity parameters derived from EEG signals, which reveal patterns of neural collaboration. In terms of technical methodologies, there is a clear shift from traditional machine learning approaches to deep learning techniques. These methodological advancements provide innovative frameworks for BCIs and cognitive computing, enabling a comprehensive understanding of the neural mechanisms underlying speech imagery. This progress lays a solid foundation for research on the SI-BCI paradigm.

SI-BCI paradigm

The SI-BCI paradigms refer to the experimental design or task paradigm used in SI-BCI systems to elicit specific speech imagery-related brain signal patterns. It determines how users interact with the SI-BCI system and the types of signals generated by the brain. In SI-BCI systems, the BCI paradigms, neural encoding, and neural decoding are critical research elements. It is important to note that without BCI paradigms, the corresponding neural encoding cannot be achieved in an SI-BCI system; without appropriate BCI paradigms and neural encoding, high-performance neural decoding becomes challenging. Similarly, without effective neural decoding, the applicability of the previously designed BCI paradigms cannot be validated (Tai et al., 2024). Specifically, neural encoding refers to the process by which a user’s different intentions are “written” or encoded into central nervous system signals under the SI-BCI paradigms. Neural decoding, on the other hand, is the process of extracting user intentions from neural signals in the SI-BCI system. It relies on the principles of neural encoding and identifies user intentions by analyzing and processing features within the neural signals. The SI-BCI paradigms not only influence neural encoding but also directly impact the performance of neural decoding. Figure 4 illustrates the relationships between brain region connectivity, the SI-BCI paradigms, neural encoding, and neural decoding in an SI-BCI system.

Figure 4. The relationship between brain connectivity, SI-BCI paradigm, neural encoding, and neural decoding.

Figure 4

In an SI-BCI system, SI-BCI paradigms based on the type of imagined material can be divided into two experimental task paradigms:

1. Vowel or vowel-consonant combination: This is a basic binary classification task, focusing primarily on simple vowel imagery or combining consonants with vowels, such as through the consonant-vowel-consonant (CVC) structure in speech tasks. These experiments help explore the brain’s neural encoding of fundamental speech units.

2. Words, Chinese characters/long words: Compared to vowels, words or longer speech segments like Chinese characters contain richer information and engage more complex neural networks. This paradigm also mainly uses binary classification, but with greater information complexity, making it suitable for more in-depth research on speech processing.

Figure 5 provides a detailed illustration of these two types of speech imagery BCI experimental paradigms and their characteristics, offering multi-level task options and research directions for the study of speech imagery BCI systems.

Figure 5. The paradigm process of BCI in speech imagery.

Figure 5

Vowel/vowel consonant combination

In speech imagery-based BCI research, vowels are frequently used as core materials due to their simplicity in articulation, stability in acoustic characteristics, and ease of perception and measurement. Researchers have extensively explored the potential of EEG signals in decoding vowel information through diverse methods, progressively advancing BCI technology. DaSalla et al.’s (2009) pioneering work laid a solid foundation for vowel imagery recognition. By recording EEG signals from healthy subjects imagining the English vowels /a/ and /u/, they applied common spatial pattern (CSP) to design spatial filters and used nonlinear SVM for classification, achieving an accuracy rate of 68% to 78%. This study effectively addressed the challenge of extracting features related to vowel imagery from EEG signals, providing crucial methodological guidance for subsequent research. Building on this, Min et al. (2016) employed the Extreme Learning Machine (ELM) algorithm combined with sparse regression feature selection techniques to classify single-trial EEG signals from five subjects imagining vowels. They found that ELM and its variants (ELM-R, ELM-L) outperformed SVM and linear discriminant analysis (LDA) in the gamma frequency band (30–70 Hz), achieving a maximum accuracy of 99%. This work offered new insights into optimizing classification algorithms for silent speech BCIs. Chengaiyan & Anandan (2022) further refined vowel classification in speech imagery using EEG signal decomposition techniques and advanced machine learning methods such as multi-class SVM (MSVM) and RF. Their study demonstrated the reliability of brain connectivity estimators and machine learning techniques in vowel recognition, providing new empirical support for the development of speech imagery recognition technology. This series of studies not only progressively improved vowel imagery classification accuracy but also deepened our understanding of EEG signal processing and vowel imagery recognition mechanisms.

As research on vowel imagery has deepened, researchers have increasingly recognized the need to incorporate consonant processing into BCI systems to more comprehensively simulate natural speech communication. Given that actual speech communication involves not only vowels but also consonants, particularly the prevalence of CVC structured words in the language, researchers have begun exploring how to decode combinations of consonants and vowels in BCI systems. This direction represents a significant step forward in the development of speech imagery-based BCI systems.

The work by Sandhya et al. (2018) marks an early exploration in this field. They investigated relative power changes during multiple speech imagery tasks, with a particular focus on frequency bands in EEG signals. By analyzing EEG signals from healthy subjects imagining and articulating CVC words, they found that the theta band exhibited higher relative power during imagined speech, while the alpha band dominated during actual speech. This suggests that different speech imagery tasks may activate distinct frequency bands in the brain. Building on the work of Sandhya et al. (2018), Chengaiyan et al. (2018) further explored the power and phase synchronization effects in EEG signals. Using the PLV to measure phase synchrony between brain regions during speech imagery, they discovered that the frontal and temporal electrodes in the left hemisphere showed the highest phase locking in the alpha and theta bands. This work highlights the critical role of functional connectivity between specific brain regions in decoding tasks during speech imagery. Sandhya et al. (2015) focused on analyzing the neural correlations of brain regions during the speech imagery of CVC words. Using brain connectivity estimators such as EEG coherence, partial directed coherence (PDC), and directed transfer function (DTF), they found that the frontal and temporal regions of the left hemisphere were more active during both imagined and actual articulation. This study provides new insights into the neural interactions in the brain during speech imagery and offers neurological support for decoding CVC words. Chengaiyan, Retnapandian & Anandan (2020) further advanced this direction by integrating deep learning techniques, such as recurrent neural networks (RNN) and deep belief networks (DBN), with brain connectivity estimators to focus on vowel recognition from EEG signals. In their experiments, DBN demonstrated higher classification accuracy compared to RNN, providing strong empirical support for the use of deep learning techniques in speech imagery decoding.

From the initial CSP and SVM-based classification methods to advanced techniques involving deep learning networks (such as CNN, RNN, DBN, and capsule networks (CapsNet)) and brain connectivity analysis, researchers have continuously deepened their understanding of vowel and vowel-consonant combination information in EEG signals. The development of these technologies has not only complemented each other methodologically but has also made significant progress in improving classification accuracy and the practicality of BCI systems.

Chinese characters/words/long words

In studies using Chinese characters as materials for pronunciation imagery, directional characters such as “左(left)”, “右(right)” have become the primary focus. As research has progressed, the emphasis has shifted from simple characters to those with tonal and rhythmic features, and from single speech imagery tasks to those combining auditory and visual conditions. In research on words and word pairs, although the materials differ, the mental tasks are similar. Notably, the use of local ear EEG as an alternative to global EEG has emerged as a novel approach.

Wang et al. (2016) made pioneering contributions in this field by introducing and applying the novel concept of “speech imagery” to BCI systems. They conducted a study to explore the impact of silent reading on mental tasks within a BCI system. They integrated speech imagery (i.e., silently reading Chinese characters) into mental tasks and found that adding speech imagery significantly improved task accuracy, with the average accuracy increasing from 76.3% to 82.3%. They also evaluated the time stability of EEG signals using Cronbach’s alpha, revealing that the tasks incorporating speech imagery exhibited higher signal stability, providing a more reliable signal source for BCI systems. Furthermore, the classification of Mandarin tones and word pairs has also received widespread attention. Li & Chen (2020) conducted in-depth studies on Mandarin tone classification, analyzing factors that affect the classification accuracy of EEG signals. This work laid the foundation for the application of BCI technology in Mandarin speech synthesis systems and contributed to the theoretical development of speech-imagery-based BCI systems. Borirakarawin & Punsawad (2023) explored hybrid BCI systems that combine auditory stimuli with speech imagery, while Zhang, Li & Chen (2020) further explored tone classification under audiovisual conditions, and their experimental results showed that the introduction of audiovisual conditions significantly improved classification accuracy, demonstrating that multimodal inputs can effectively enhance BCI system performance and provide new experimental evidence for Mandarin tone-based BCI applications.

In research on words and word pairs, innovative algorithms and techniques have emerged continuously. Mahapatra & Bhuyan (2022) achieved multi-class imagined speech (vowels and words) classification based on EEG signals by constructing a deep model that integrates temporal convolutional networks (TCN) with CNN, combined with discrete wavelet transform (DWT) preprocessing. They achieved an overall accuracy of 96.49% on data from 15 participants, providing an efficient decoding solution for non-invasive silent brain-computer interfaces. Pan et al. (2023) employed a light gradient boosting machine (LightGBM) for feature classification in Chinese character speech imagery BCI systems, and the results showed that LightGBM was more accurate and applicable than traditional classifiers. Tsukahara et al. (2019) explored EEG frequency components related to speech imagery and found significant ERS in the left hemisphere’s alpha band under speech imagery conditions, emphasizing the importance of selecting appropriate electrode positions for EEG frequency component recognition. Kaongoen, Choi & Jo (2021) were the first to explore BCI systems based on ear EEG (ear-EEG), providing initial empirical support for the application of ear-EEG in the BCI field.

These studies not only explored various EEG signal processing and feature extraction methods but also experimented with multiple classifiers to improve the decoding accuracy of EEG signals. Furthermore, they considered the complexity of speech imagery, such as semantics, rhythm, and brain activity under different stimulus conditions. These efforts are particularly significant in optimizing BCI system design, improving signal decoding accuracy, and achieving effective classification of Mandarin tones and word pairs.

SI-BCI neural encoding

SI-BCI neural encoding refers to the process of translating distinct user intentions into central nervous system signals within the speech imagery paradigm, characterized by identifiable brain signal features (Xu et al., 2021). EEG technology is employed to detect brain signals that encode user intentions, which are subsequently decoded by SI-BCI neural decoding algorithms to identify these intentions. Brain signals acquired through EEG technology exhibit multiple features in the time domain, frequency domain, and spatial domain. Therefore, SI-BCI neural encoding can utilize these three types of features for encoding, such as mean, variance, power spectral density (PSD), wavelet transform (WT), short-time Fourier transform (STFT), DWT, and CSP. Figure 6 illustrates the process of SI-BCI neural encoding.

Figure 6. SI-BCI neural encoding process.

Figure 6

The SI-BCI neural encoding process involves three main steps: EEG signal acquisition, preprocessing, and feature extraction. Depending on the research context, EEG signal acquisition may involve electrode caps with configurations ranging from eight channels to more than 64 channels. Common acquisition devices include Neuroelectrics (NE), SynAmps 2, TMSi SAGA, and EMOTIV, among others. As the signal acquisition process predominantly relies on hardware, the quality of the acquired signals can be significantly influenced by the experimenter’s proficiency. Thus, we present only common electrode cap configurations and device types to offer essential references for researchers.

Signal preprocessing

EEG signal preprocessing involves filtering out irrelevant or interfering signals from the raw EEG data acquired by the device and selecting the signal channels most relevant to speech imagery from specific brain regions. Due to the low signal-to-noise ratio (SNR) of EEG signals, artifacts often occur during acquisition. Therefore, current research focuses on using techniques such as filtering, wavelet transforms, adaptive filters, and statistical methods to remove noise and artifacts, thereby improving signal quality. Jahangiri & Sepulveda (2017) found that the alpha band (8–13 Hz) and beta band (14–30 Hz) yield higher classification performance in speech imagery classification tasks. Consequently, during time-domain and time-frequency analysis, signals are typically filtered within the 8–30 Hz range, with different bandpass filters selected based on specific requirements. Moattari, Parnianpour & Moradi (2017) proposed using independent component analysis (ICA) based on higher-order non-Gaussianity for the source separation stage of preprocessing. Nitta et al. (2023) extracted linguistic representations of Japanese vowels using principal component analysis (PCA) and structural modeling (SM), and demonstrated the feasibility of extracting linguistic features from EEG signals by classifying them with a CNN. Sree, Kavitha & Divya (2023) evaluated various preprocessing methods for EEG signals based on speech imagery. Through a comprehensive comparison of different preprocessing techniques, they identified the optimal approach based on mean squared error (MSE) and peak signal-to-noise ratio (PSNR).

Feature extraction

The feature extraction stage is the core of SI-BCI neural encoding. This process involves extracting useful information from the acquired EEG signals, which is then used in subsequent neural decoding to analyze brain states. Feature extraction algorithms can be categorized into spatial domain methods, time domain methods, frequency domain methods, time-frequency domain methods, and nonlinear feature extraction methods.

In the spatial domain, the main research trend has shifted from traditional CSP and their variants to methods such as Mel-frequency cepstral coefficients (MFCCs). Overall, new methods in the spatial domain are continuously emerging. Wang et al. (2021) proposed a feature extraction method combining causal networks and CSP, achieving higher classification accuracy using a binary quantum particle swarm optimization-based extreme learning machine. Huang et al. (2022) introduced a novel algorithm called sub-time window filter bank common spatial pattern (STWFBCSP), aimed at improving sequence encoding-based active BCIs. By subdividing time windows and applying multi-frequency filtering, STWFBCSP can extract more refined features, thereby enhancing classification accuracy. The final average classification accuracy of STWFBCSP reached 84.87%, compared to 70.49% for the traditional CSP algorithm, demonstrating its superior performance and highlighting the significant potential of time window subdivision and multi-frequency analysis in improving BCI performance. Alizadeh & Omranpour (2023) proposed a novel multi-class CSP feature extraction method, feeding the extracted features into an ensemble learning classifier (including logistic regression, K-nearest neighbors, decision trees, SVM, and Gaussian naive Bayes). After multiple rounds of cross-validation, the effectiveness of the proposed algorithm was confirmed. This combined validation approach effectively inspires algorithmic fusion and innovation. Table 1 illustrates CSP combined with other feature classification algorithms.

Table 1. CSP combined with other feature classification algorithms.

A M P D S DE E
DaSalla et al. (2009) CSP /a/, /u/ Private data 3 (2 m and 1 f/age from 26–29) BioSemi ActiveTwo 68–78
SVM
Agarwal et al. (2020) CSP /a/, /u/, /rest/ DaSalla et al. (2009) 3 (2 m and 1 f/age from 26–29) BioSemi ActiveTwo 87.56
SVM
Rostami & Moradi (2015) EMBCSP /a/, /u:/ Private data 5 (3 m and 2 f/age from 23–33) N/A Average: 89.42
SVM
Wang et al. (2016) CSP Chinese character “左(Left), 壹(One)” Private data 10 (7 m and 3 f/age from 22–28) SynAmps 2 Average: 82.3
SVM
Wang et al. (2019) CSP, CCF, PLV
SVM
Chinese character “移(Move)” Private data 10 (8 m and 2 f/age from 22–28) SynAmps 2 Average: 74.3
Zhang, Li & Chen (2020) CSP /ba/ Private data 14 (6 m and 8 f/age from 19–22) N/A Under combined stimuli: 80.1
SVM
Wang et al. (2021) CSP Chinese character “移(Move)” Private data 10 (8 m and 2 f/age from 22–28) SynAmps 2 Average: 73.9
SVM
Zhao, Liu & Gao (2021) CSP, DWT
SVM, ELM
Chinese initial consonant Private data 8 (6 m and 2 f/ average age 23.67) N/A Highest: 73.04
Wang et al. (2022) CSP Chinese character “壹(One)” Private data 12 (9 m and 3 f/age from 22 to 26) Neusen W2 Average: 68.94
SVM

Note:

A, authors; M, methods; P, pronunciation materials; D, datasets; S, subjects (number); DE, device; E, evaluation indicators (accuracy: %); m, males; f, females; CSP, common spacial pattern; SVM, support vector machine; EMBCSP, evidential multi-band common spacial pattern; CCF, cross correlation function; PLV, phase locking value; DWT, discrete wavelet transform; ELM, Extreme Learning Machine; SI, speech imagery; MI, motor imagery.

Cooney, Folli & Coyle (2018) evaluated the impact of three feature sets—linear, nonlinear, and MFCCs—on classification performance. The study found that MFCC features performed best in both decision tree and SVM classifiers, with SVM achieving an average accuracy of 20.80% on MFCC features, significantly outperforming linear (15.91%) and nonlinear (14.67%) features. This validates the effectiveness of MFCCs in capturing acoustic differences in speech. Martin et al. (2016) introduced the dynamic time warping (DTW) algorithm to address temporal irregularities in speech production, improving the classification performance of SVM. Notably, this marked the first successful classification of individual words during speech imagery tasks. Wu & Chen (2020) proposed a method for extracting temporal envelope features from EEG signals during feature extraction and directly classified these features using SVM. Although the final performance was somewhat lacking compared to other methods, this work explored the relationship between temporal envelope features and speech. Table 2 illustrates other feature extraction algorithms in the spatial domain.

Table 2. Other feature extraction algorithms in spatial domain.

A M P D S DE E
Wang et al. (2021) CN, CSP
BQPSO
ELM
Chinese characters “壹(one)” Private data 10 (7 m and 3 f/age from 22–28) SynAmps 2 Average: 85.4
Huang et al. (2022) STWFBCSP Chinese characters “右(right)” Private data 12 (8 m and 4 f/age from 22–26) N/A Average: 84.87
Alizadeh & Omranpour (2023) EM-CSP Words, English phonemes 1. Kara One (Zhao & Rudzicz, 2015) 12 (8 m and 4 f/age mean 27.4) SynAmps RT Highest: 98.47
2. Private data
Panachakel & Ramakrishnan (2022) CSP “in, cooperate” ASU (Nguyen, Karavas & Artemiadis, 2017) 15 (11 m and 4 f/age from 22–32) BrainProducts Highest: 85.2
LSTM ActiCHamp amplifier

Note:

A, authors; M, methods; P, pronunciation materials; D, datasets; S, subjects (number); DE, device; E, evaluation indicators (accuracy: %); m, males; f, females; CN, causal network; CSP, common spatial pattern; BQPSO, binary quantum particle swarm optimization; ELM, Extreme Learning Machine; STWFBCSP, sub-time window filter bank common spatial pattern; EM-CSP, efficient-multiclass CSP; LSTM, long short-term memory; SI, speech imagery; MI, motor imagery.

Time-domain methods generally involve selecting features such as the mean, variance, kurtosis, and skewness of signals from each channel. Research in this area primarily focuses on extracting features using wavelet transforms and wavelet decomposition for signals of different frequencies. Idrees & Farooq (2016a, 2016b) proposed a dual-pronged strategy: classifying using simple features in the time domain while extracting EEG signal features from the beta, delta, and theta rhythms using wavelet decomposition. This approach demonstrated the effectiveness of time-domain and frequency-domain features in vowel imagery classification, further expanding the potential of EEG signal processing techniques in speech imagery tasks. Saji et al. (2020) advanced this research by classifying vowel signals using wavelet transforms and multiple classifiers, extracting features such as mean, standard deviation, and band power. This study not only confirmed the effectiveness of EEG signal processing techniques in improving vowel imagery classification accuracy but also provided new empirical support for speech imagery-based BCI technology.

Frequency-domain methods have been more deeply explored in areas such as PSD, DWT, and MFCCs. Cooney, Folli & Coyle (2018) analyzed the EEG signals of imagined speech from 14 participants in the Kara One dataset, evaluating the impact of three feature sets—linear, nonlinear, and MFCCs—on classification performance. The study found that MFCC features performed best in both decision tree and SVM classifiers, with SVM achieving an average accuracy of 20.80% on MFCC features, significantly outperforming linear (15.91%) and nonlinear (14.67%) features. This research demonstrated that MFCC-based feature extraction can significantly improve the accuracy of imagined speech decoding, providing important methodological support for the development of SI-BCIs. Time-frequency domain methods combine information from both the time and frequency domains, commonly using techniques such as STFT and WT. Pan et al. (2022) investigated the impact of imagined syllable rhythms on EEG amplitude spectra, offering new insights for the development of speech imagery-based BCIs. Kamble, Ghare & Kumar (2022) preprocessed EEG signals through bandpass filtering and downsampling in the time domain, combined with time-frequency conversion using the smoothed pseudo-Wigner-Ville distribution (SPWVD). This mapped 1D signals into 2D time-frequency images, enabling optimized CNNs to automatically extract multi-scale time-frequency features. This approach achieved efficient speech imagery classification with a maximum accuracy of 94.82%, validating the advantages of SPWVD in time-frequency resolution and cross-term suppression. It provided a time-frequency analysis-based method for SI-BCI systems.

Nonlinear feature extraction methods refer to the use of deep learning to automatically extract features for subsequent classification tasks. Unlike traditional machine learning, deep learning-based feature extraction methods are often not separable from feature classification methods. This is primarily because, in contrast to deep learning, traditional machine learning requires manual intervention during feature extraction to transform raw data into meaningful features. The core focus of deep learning research lies in developing more efficient and convenient algorithms for feature extraction while maximizing final classification performance. Wang et al. (2022) proposed a parallel CNN model based on a multi-band brain network, utilizing correlation coefficients and phase-locking values to describe inter-channel synchrony and correlation. This approach effectively captures EEG signal synchronization and correlation features, leading to improved classification accuracy. Ramirez-Quintana et al. (2023) introduced a novel deep capsule neural network, CapsVI, designed to recognize vowel imagery patterns from EEG signals. Their model achieved an average accuracy of 93.32%, setting a new benchmark for English vowel recognition. Retnapandian & Anandan (2023) focused on identifying phonemes corresponding to vowels from EEG signals. They trained a RNN using multi-trial data and compared the classification performance between single-trial and multi-trial datasets. By incorporating refined feature extraction techniques such as wavelet decomposition and multifractal analysis, their study further enhanced the classification capabilities of the model, laying the foundation for the development of efficient and precise speech imagery-based BCI systems. Macías-Macías et al. (2020) developed sCNN and sCapsNet models based on spectro-temporal receptive fields (STRFs). Notably, sCapsNet demonstrated superior performance in vowel classification, highlighting the significant potential of deep learning in vowel imagery classification.

In addition to the commonly used feature extraction algorithms mentioned above, several novel methods have also been applied to SI-BCI feature extraction. Sikdar et al. (2018), Sikdar, Roy & Mahadevappa (2018) explored the roles of multifractal and chaotic parameters in different imagery tasks for International Phonetic Alphabet (IPA) vowel recognition, providing new analytical methods for identifying brain regions activated during different vowel imagery processes. Martin et al. (2016) utilized high-gamma band (70–150 Hz) temporal features and a SVM model, introducing nonlinear temporal alignment via an SVM kernel, achieving 88% classification accuracy in a binary classification framework. Nguyen, Karavas & Artemiadis (2017) proposed a novel approach based on covariance matrix descriptors, which reside on the Riemannian manifold, and employed a relevance vector machine (RVM) for classification. This method was tested on EEG signals from multiple subjects and performed well across various categories, including imagined vowels, short words, and long words, with classification accuracy significantly exceeding chance levels in all cases. Kalaganis et al. (2023) leveraged approximate joint diagonalization (AJD) for covariance estimation, further advancing research on decoding human speech imagery from EEG signals.

In summary, during SI-BCI neural decoding, optimizing the performance of traditional algorithms (e.g., CSP, RF) while further exploring deep learning-based approaches (e.g., CNN, RNN) has become a key trend in EEG signal preprocessing and feature extraction. Meanwhile, unconventional algorithms (e.g., covariance matrix-based methods) have also garnered increasing attention.

SI-BCI neural decoding

SI-BCI neural decoding refers to the process of identifying a user’s intent by employing efficient decoding algorithms on neural-encoded brain signals, which are then translated into syllables, words, Chinese characters, or other linguistic units. These decoded outputs are subsequently presented through visual or auditory modalities (Saha & Fels, 2019). Figure 7 illustrates the SI-BCI neural decoding process.

Figure 7. SI-BCI neural decoding process.

Figure 7

Currently, based on different decoding algorithms, the current SI-BCI decoding methods can be divided into two categories: machine learning and deep learning. In machine learning algorithms, the feature extraction and feature classification steps are separate, so innovative combinations of different types of feature extraction algorithms and feature classification algorithms have led to numerous high-performance decoding algorithms. It is worth noting that research using CSP (Cai et al., 2024) (feature extraction) combined with SVM (Rezvani et al., 2024) (feature classification) as baseline methods has always been highly praised. Additionally, methods such as wavelet decomposition (Ji et al., 2024), covariance matrices (Luo et al., 2024), relevance vector machines (Khatti & Grover, 2024), and extreme learning machines (Wei et al., 2024) have also demonstrated unique performance values. Deep learning algorithms, on the other hand, use one or more neural network models to complete both feature extraction and feature classification. Among these, convolutional neural networks (Liang et al., 2023) have achieved excellent performance in SI-BCI neural decoding. In addition, capsule networks (Aboussaleh et al., 2024) and gated mechanisms (Li et al., 2024) have also shown good potential.

However, in the study of SI-BCI decoding methods, the performance of decoding algorithms depends not only on the characteristics of the algorithms themselves but also closely on the quality and diversity of the data. Both machine learning and deep learning approaches require large amounts of high-quality data for model training and validation. Therefore, before systematically reviewing neural encoding, understanding the current usage of existing datasets is of significant importance for further advancing SI-BCI technology.

Datasets

We have compiled statistics on existing public and private SI-BCI datasets, and Fig. 8 visually presents their usage in percentage. Current research predominantly relies on private datasets from individual laboratories, with public dataset usage being relatively limited. This phenomenon is influenced by several factors, including the complexity of data collection, the need for data quality assurance, and limitations in participant numbers. Among these, the most frequently used public datasets are the Kara One, DaSalla, and ASU datasets. Thus, we provide an introduction to these three primary datasets, along with a recently released high-channel EEG dataset based on Chinese language stimuli.

Figure 8. Proportion of usage between public and private datasets.

Figure 8

Kara One (Zhao & Rudzicz, 2015): The Kara One dataset was collected and organized by the Department of Computer Science at the University of Toronto in collaboration with the Toronto Rehabilitation Institute. It integrates three modalities (EEG, facial tracking, and audio) during imagined and vocalized phonemes and single-word prompts, providing access to the brain’s language and speech production centers. Each participant was seated in a chair in front of a computer monitor. A Microsoft Kinect (v.1.8; Microsoft Corp., Redmond, WA, USA) camera was placed beside the screen to record the participant’s facial information and speech. Each trial consisted of four consecutive states:

1. A 5-s rest state, during which participants were instructed to relax and clear their minds.

2. A stimulus state, where a prompt text appeared on the screen, and its associated auditory utterance was played through computer speakers. This was followed by a 2-s period during which participants prepared their articulatory posture to begin vocalizing the prompt.

3. A 5-s imagined speech state, during which participants imagined speaking the prompt without actual movement.

4. A speaking state, where participants vocalized the prompt aloud. The Kinect sensor recorded audio and facial features during this phase.

ASU (Nguyen, Karavas & Artemiadis, 2017): The ASU dataset, provided by Arizona State University, is widely used for research and development in EEG-based BCI systems, particularly in identifying and decoding brain activity patterns related to imagined speech. A key feature of this dataset is that it includes EEG signals from multiple subjects performing specific imagined speech tasks. Participants were instructed to imagine hearing words or phrases rather than vocalizing them. The dataset comprises multiple EEG channels to capture brain activity. The task design incorporates various conditions, such as imagining different words or phrases, which aids researchers in exploring the patterns of brain activity during the process of speech imagination.

DaSalla (DaSalla et al., 2009): The DaSalla dataset was compiled and contributed by DaSalla and their team. This dataset generates speech-related potentials (SRP) in EEG signals for imagined vowels /a/ and /u/. Compared to other dataset’s imagined speech paradigms, SRP exhibits better variance. Each trial lasts approximately 1 min. It begins with a beep at 0 s to prepare the subject for the first step. The trial then proceeds through three stages:

1. A cross appears on the display for a random duration of either 2 or 3 s to prevent erroneous event-related potentials (ERPs) in subsequent steps.

2. Vowel imagination is performed in this step. It is represented by an open mouth outline, a rounded mouth, or a cross, indicating the /a/, /u/, or control state imagination tasks, respectively. This step lasts for 2 s. In this study, the control state is referred to as /no vowel/ since it involves no vowel. Each subject performed 150 trials, with 50 trials per task. The presentation of /a/, /u/, and /no vowel/ imagination tasks were randomized to train subjects to develop speech imagination tasks solely through endogenous stimuli.

3. A blank image is displayed on the screen for 3 s to indicate a rest period and the end of the trial, allowing the subject to wait for the next trial.

ChineseEEG (Mou et al., 2024): The ChineseEEG dataset is a high-channel EEG dataset based on Chinese language stimuli, providing a rich resource for semantic alignment and neural decoding. It includes high-density EEG data and synchronized eye-tracking data from 10 participants during approximately 13 h of silent reading of Chinese texts. The data is derived from the Chinese versions of two well-known novels, The Little Prince and Dream of the Wolf King. Each participant was exposed to diverse Chinese language stimuli, which is crucial for studying the long-term neural dynamics of language processing in the brain. Additionally, the 128-channel high-density EEG data offers superior spatial resolution, enabling precise localization of brain regions involved in language processing. With a sampling rate of 1 kHz, the dataset effectively captures the dynamic changes in neural representations during reading.

Machine learning based SI-BCI neural decoding

Most studies based on machine learning employ CSP for feature extraction and SVM for feature classification. The principle of SVM is to use mathematical optimization methods to find a hyperplane that maximizes the separation of two classes of data (Rezvani et al., 2024). The principle of CSP is to find a set of spatial filters that maximize the variance differences between different categories of EEG signals (Cai et al., 2024). This is the most common baseline approach. DaSalla et al. (2009), using the method of CSP combined with SVM, successfully achieved a classification accuracy rate of 68% to 78% for imagined pronunciations of the English vowels /a/ and /u/, laying the foundation for this field. Agarwal et al. (2020) combined CSP, statistical features (standard deviation, root mean square (RMS), energy), and wavelet transforms and employed a Random Forest classifier for multi-class classification of imagined vowels /a/, /u/, and the rest state. They achieved a maximum accuracy of 89% on data from three subjects, demonstrating the effectiveness of integrating CSP with statistical features in silent speech decoding. Rostami & Moradi (2015) proposed combining CSP with multiple bandpass filters and evidence theory, achieving a classification accuracy superior to traditional CSP methods. Table 3 shows the use of CSP and SVM along with their derived algorithms.

Table 3. Adopting CSP, SVM and their derived algorithms.

A M P D S DE E
Cooney, Folli & Coyle (2018) MFCCs,
SVM,
/iy/, /uw/, /piy/, /tiy/, /diy/, /m/, /n/ Kara One (Zhao & Rudzicz, 2015) 12 (8 m and 4 f/age mean 27.4) SynAmps RT 35
“pat”, “pot”, “knew”, “gnaw”
Martin et al. (2016) TDC
SVM+DTW
Words Private data 5 g.USBamp Average: 57.7
Highest: 88.3
Wu & Chen (2020) TE /iy/, /uw/, /piy/, /tiy/, /diy/, /m/, /n/ Kara One (Zhao & Rudzicz, 2015) 12 (8 m and 4 f/age mean 27.4) SynAmps RT Normalized covariance: 0.57
SVM “pat”, “pot”, “knew”, “gnaw”
Chengaiyan & Anandan (2022) MSVM /a/, /e/, /i/, /o/, /u/ Private data 5 (5 m / age from 19–21) Emotiv Epoc wireless EEG Highest: 80
RF

Note:

A, authors; M, methods; P, pronunciation materials; D, datasets; S, subjects (number); DE, device; E, evaluation indicators (accuracy: %); m, males; f, females; MFCCs, Mel-frequency cepstral coefficients; AC, autoregressive coefficient; SVM, support vector machines; TDC, time domain characteristics; DTW, dynamic time warping; TE, temporal envelope; MSVM, multi-class support vector machine; RF, random forest.

In addition to the aforementioned methods, several other feature classification approaches have also garnered significant attention. Idrees & Farooq (2016a) designed a vowel classification method based on wavelet decomposition and statistical features, achieving significant results with three subjects. Nguyen, Karavas & Artemiadis (2017) proposed a method based on Riemannian manifold features and RVM, achieving high accuracy in binary classification, but the accuracy in three-class classification significantly decreased. To improve the accuracy of three-class classification, they proposed a new method combining spatial covariance with an adaptive weighting model. Experimental results showed significant improvement, and surprisingly, subjects from the BCI blind group also achieved good results. Table 4 summarizes other machine learning feature extraction and classification algorithms.

Table 4. Summary of other machine learning feature extraction and feature classification algorithms.

A M P D S DE E
Idrees & Farooq (2016a) TDC /a/, /u/, other Private data 3 (2 m and 1 f) BioSemi ActiveTwo Combination average: 85–100
LC
Idrees & Farooq (2016b) WD /a/, /u/, “no” Private data 3 (2 m and 1 f) BioSemi ActiveTwo Combination average: 81.25–98.75
LC
Moattari, Parnianpour & Moradi (2017) HON-ICA /a/, /u:/ Rostami & Moradi (2015) 5 (3 m and 2 f/age from 23–30) N/A 66.67–93.33
Nguyen, Karavas & Artemiadis (2017) CM “/a/,/i/,/u/”
“in, out, up”
Private data 15 (11 m and 4 f/age from 22–32) BrainProducts ActiCHamp amplifier Highest: 95 (Binary classification), 70 (Three categories)
RVM “cooperate, independent”
Nguyen, Karavas & Artemiadis (2019) SCM Long word: “concentrate” Private data 8 (6 m and 2 f/age from 22–32) BrainProducts ActiCHamp amplifier Average: 52.5
RVM Short word: “split”
Kim, Lee & Lee (2020) ERP “Ah”, Specific nouns Private data 2 (2 f/age from 22–27) ActiCap EEG amplifier Highest combination: 88.1
RLDA
Wang et al. (2020) PSD, SampEn
ELM
Chinese character “移(move)” Private data 12 (8 m and 4 f/age from 20–26) SynAmps 2 Average: 83
Pan et al. (2023) WPD Chinese character “左(Left), 壹(One)” Wang et al. (2021) 8 (6 m and 2 f/age from 22–27) SynAmps 2 Average: 90
LightGBM

Note:

A, authors; M, methods; P, pronunciation materials; D, datasets; S, subjects (number); DE, device; E, the evaluation indicators (accuracy: %); m, males; f, females; TDC, time domain characteristics; LC, linear classifier; WD, wavelet decomposition; HON-ICA, higher orders of Non-Gaussianity independent component analysis; CM, covariance matrix; SCM, spatial covariance matrix; RVM, relevance vector machines; ERP, event-related potential; RLDA, regularized linear discriminant analysis; PSD, power spectral density; SampEn, sample entropy; ELM, Extreme Learning Machine; WPD, wavelet packet decomposition; LightGBM, light gradient boosting machine.

Deep learning based SI-BCI neural decoding

There are numerous algorithms in deep learning, and within the field of SI-BCI, CNN, RNN, and capsule neural networks have garnered significant attention. CNNs have shown significant performance in image processing. They are deep, feedforward artificial neural networks. CNNs consist of an input layer, several hidden layers, and an output layer, with trainable weights and biases used to construct the neural framework. Each neuron receives data and then performs a dot product using a nonlinear function. The hidden layers include a series of convolutional layers and pooling layers, which multiply or perform dot product convolutions in other ways. Figure 9 is a typical CNN model generated using our visualization tool.

Figure 9. Typical convolutional neural network model.

Figure 9

Research on CNNs tends to combine them with existing feature extraction methods, feeding the extracted features into the CNN to capture finer details. Saha & Fels (2019) proposed a novel hybrid deep neural network architecture for EEG speech imagery classification tasks. It is noteworthy that this was the first time CNNs were introduced into the field of speech imagery brain-computer interfaces, leading to groundbreaking advancements in subsequent research. Macías-Macías et al. (2020) first used CSP to extract signal features, then employed CNNs for feature classification. This combination effectively avoided the problem of CNNs losing significant features during the feature extraction process. Lee et al. (2021) used a CNN model to predict the word length corresponding to the input EEG signal and explored the impact of word length on speech imagery classification performance, concluding that training strategies based on limited word length could effectively improve overall classification accuracy. Park, Yeom & Sim (2021) used CNNs to classify brain spatial features formed by EEG channels and predict user states. However, this method was not compared with others, making it impossible to assess its relative performance. Ahn et al. (2022) modified the convolutional layers of CNNs by replacing single convolution kernels with multi-scale kernels, allowing for the extraction of features from different frequency bands and showing high and stable classification performance across different datasets. Kwon et al. (2023) proposed an improved deep learning model by incorporating multi-layer perceptrons (MLP) into ShallowConvNet and EEGNet to enhance cross-paradigm subject identification capabilities in BCI systems. Table 5 presents some applications of CNNs in SI-BCI neural decoding.

Table 5. Application of convolutional neural networks in SI-BCI neural decoding.

A M P D S DE E
Saha & Fels (2019) HDNN “/a/, /i/, /u/” Nguyen, Karavas & Artemiadis (2017) 15 (11 m and 4 f/age from 22–32) BrainProducts ActiCHamp amplifier 71.1–90.7
“in, out, up”
“cooperate, independent”
Saha, Fels & Abdul-Mageed (2019) CNN+LSTM Vowels, words Kara One (Zhao & Rudzicz, 2015) 12 (8 m and 4 f/age mean 27.4) SynAmps RT Highest: 85.23
Macías-Macías et al. (2020) CSP /a/,/u/ DaSalla et al. (2009) 3 (2 m and 1 f/age from 26–29) BioSemi ActiveTwo sCapsNet 71.9
sCNN/sCapsNet sCNN 67.63
Lee et al. (2021) LFSICF “hello, help me, thank you” BCI Competition V3 (BCI Competition Committee, 2022) 15 (age from 26–29) N/A Average: 59.47
“stop, yes”
Park, Yeom & Sim (2021) MI+CNN “hello” Private data 10 (5 m and 5 f/age from 20–30) Compumedics 64-channel EEG and STIM2 Average: 88.25 ± 2.34
Wang & Wang (2022) MBBNPCNN Chinese characters “左(Left), 壹(One)” Wang et al. (2021) 10 (7 m and 3 f/age from 22–28) SynAmps 2 Average: 83.72
Ahn et al. (2022) MSCT “go, stop, in, cooperate” 1. Private data 1.40 (40 m /age from 22–28) BrainAmp Private dataset 62
2. ASU (Nguyen, Karavas & Artemiadis, 2017) 2.15(11 m and 4 f/age from 22–32) BrainProducts ActiCHamp amplifier ASU 72
Cui et al. (2023) TDSC, STFT, DWT, CSP
Deep ConvNet
“/a/, /u/, /i/, /ü/” Li, Pun & Chen (2021) 11(7 m and 4 f/age from 20–30) 64-channel electrode cap Highest: 68.7
“/b_/, /f_/, /j_/, /l_/, /m_/”
Jeong et al. (2022) CNN “I, partner, move, have, drink, box, cup phone” Private data 11 (6 m and 5 f/age from 20–34) BrainVision recorder Percent Valid Correct: 81
GRU
Kwon et al. (2023) ShallowConvNet Word Private data 5 (2 m and 3 f/age from 23–26) BrainProducts’ actiCham 34–50
EEGNet
MLP
Nitta et al. (2023) LPA Japanese syllables Private data 1 (1 f/age 23) g.HIAMP Average: 72.6
PCA
CNN

Note:

A, authors; M, model; P, pronunciation materials; D, datasets; S, subjects (number); DE, device; E, evaluation indicators (accuracy: %); m, males; f, females; HDNN, hybrid deep neural network; CNN, convolutional neural network; LSTM, long short-term memory; LFSICF, length first speech imagination classification framework; MI, mutual information; MBBNPCNN, parallel convolutional neural network based on multi-band brain networks; MSCT, multiscale convolutional transformer; TDSC, time domain statistical characteristics; STFT, short-time Fourier transform; DWT, discrete wavelet transform; GRU, gated recurrent unit; MLP, multi-layer perceptron; LPA, linear predictive analysis; PCA, principal component analysis.

RNNs are used to process variable-length sequential data, such as time-series data and sound. They consist of a series of connected feedforward neural networks. The system uses temporal correlations to represent input history and predict outputs within the network. Figure 10 shows a typical recurrent neural network structure. Research on this type of network primarily focuses on the classification and prediction of continuously varying signals over time. Chengaiyan, Retnapandian & Anandan (2020) first used brain connectivity estimators and RNNs to identify vowels, achieving a classification accuracy of 72%, which demonstrated the effectiveness of this method. Retnapandian & Anandan (2023) proposed a method based on EEG subband signal features and the RNN model. They decomposed the EEG signal into five subband frequency bands and extracted energy coefficients (RMS, mean absolute value (MAV), integrated EEG (IEEG), simple square integral (SSI), variance of EEG (VAR), average amplitude change (AAC)) and relative power from each subband as features, using the RNN for multi-class vowel classification. After multiple rounds of experiments, the classification accuracy approached 90%. However, the training of the RNN uses the backpropagation through time (BPTT) algorithm. Due to difficulties with gradient explosion and vanishing, backpropagating gradients over long time intervals is challenging (Gomez et al., 2017; Vorontsov et al., 2017). Long short-term memory (LSTM) and gated recurrent units (GRU) have become popular alternatives to RNNs. Jeong et al. (2022) used GRUs to decode spatio-temporal frequency features at the sentence level. Hernandez-Galvan, Ramirez-Alonso & Ramirez-Quintana (2023) employed 1D convolution layers and two layers of bidirectional GRUs to extract time-frequency features from EEG signals, using prototypical networks for classification. The average accuracy across two public datasets was above 91%. Table 6 presents the applications of RNNs and their variants.

Figure 10. Typical recurrent neural network model.

Figure 10

Table 6. Application of recurrent neural networks in SI-BCI neural decoding.

A M P D S DE E
Chengaiyan, Retnapandian & Anandan (2020) BCE, RNN, DBN CVC words Private data 6 ( 3 m and 3 f/age mean 20) Placing Ag/AgCl electrodes 80
Hernandez-Galvan, Ramirez-Alonso & Ramirez-Quintana (2023) Proto-Speech Vowels, short words, long words 1. Kara One (Zhao & Rudzicz, 2015) 1. 3 (2 m and 1 f/age from 26–29) SynAmps RT Kara One: 99.89–99.92 (Binary classification)
2. ASU (Nguyen, Karavas & Artemiadis, 2017) 2. 15(11 m and 4 f/age from 22–32) BrainProducts ActiCHamp amplifier 91.51(Multi classification)
ASU: 93.70 (Multi classification)
Jeong et al. (2022) CNN, GRU “I, partner, move, have, drink, box, cup phone” Private data 11 (6 m and 5 f/age from 20–34) BrainVision Recorder Percent valid correct: 81
Retnapandian & Anandan (2023) RNN /a/, /e/, /i/, /o/, /u/ Private data 5 (5 m) Wireless Emotiv EPOC+ Neuro-technology 84.5–88.9

Note:

A, authors; M, model; P, pronunciation materials; D, datasets; S, subjects (number); DE, device; E, evaluation indicators (accuracy: %); m, males; f, females; BCE, brain connectivity estimators; DBN, deep belief networks; CVC, consonant–vowel–consonant; RNN, recurrent neural network; CNN, convolutional neural network; GRU, gated recurrent unit.

CapsNets aim to address the challenges faced by CNNs in handling spatial hierarchies and pose variations in images. Its core advantage lies in its ability to capture the hierarchical structure and pose variations of objects within images, providing a more refined and robust feature representation than CNNs. Figure 11 shows a simple capsule network model (Aboussaleh et al., 2024). Macías-Macías et al. (2023) proposed a capsule network-based classification model, CapsK-SI, which achieved recognition of bilabial sounds, nasal sounds, consonant-vowel combinations, and the /iy/ and /uw/ vowels. It also generated relevant brain activity maps, offering clues for further understanding the neural mechanisms of speech imagination. Ramirez-Quintana et al. (2023) proposed a deep capsule network model called CapsVI, which achieved an average accuracy of 93.32% in paired classification. Table 7 presents applications of CapsNets and its variants.

Figure 11. Typical capsule neural network model.

Figure 11

Table 7. Application of capsule neural network in SI-BCI neural decoding.

A M P D S DE E
Macías-Macías et al. (2023) CapsK-SI Phonemes, words Kara One (Zhao & Rudzicz, 2015) 12 (8 m and 4 f/age mean 27.4) SynAmps RT 89.70–94.33
Ramirez-Quintana et al. (2023) CSP, CapsVI /a/, /u/, /no/ DaSalla et al. (2009) 3 (2 m and 1 f/age from 26–29) BioSemi ActiveTwo Average: 93.32
Highest: 94.68

Note:

A, authors; M, model; P, pronunciation materials; D, datasets; S, subjects (number); DE, device; E, evaluation indicators (accuracy: %); m, males; f, females; CSP, common spatial pattern.

Research on other deep learning methods has also demonstrated their effectiveness in decoding SI-BCI systems. Saji et al. (2020) and Watanabe et al. (2020) respectively employed DWT and speech amplitude envelope rhythm extraction to extract EEG features, using different classifiers for speech imagination classification, demonstrating the application potential of deep learning models in different linguistic contexts. Islam & Shuvo (2019) proposed a deep learning-based improvement method for beta-band selection, combining DenseNet and Gramian Angular Field to address the issue of EEG signal classification accuracy, significantly improving classification accuracy. Panachakel & Ganesan (2021) sed a transfer learning model based on ResNet50 to classify the ASU imagined speech EEG dataset, showcasing the effectiveness of transfer learning in EEG signal classification. Table 8 presents the applications of some other deep learning algorithms.

Table 8. Application of other deep learning algorithms in SI-BCI neural decoding.

A M P D S DE E
Islam & Shuvo (2019) DenseNet, GAF Syllables Kara One (Zhao & Rudzicz, 2015) 12 (8 m and 4 f/age mean 27.4) SynAmps RT DenseNet: 90.68
Words GASF: 90.54
Saji et al. (2020) DWT, LDA/SVM/KNN /a/, /u/ Public dataset (ATR brainliner, 2015) 3 (8 m and 4 f/age from 26–29) N/A 72–80
Watanabe et al. (2020) EEGOSESM Japanese syllables [ba], [ba:] Private data 18 (12 m and 6 f/age mean 23.8) BrainAmp 54.7 (Perceived speech)
38.5 (Imagining speech)
Panachakel & Ganesan (2021) 3DCE, RTL Long words ASU (Nguyen, Karavas & Artemiadis, 2017) 15 (11 m and 4 f/age from 22–32) BrainProducts ActiCHamp amplifier H1 93.4
Short words H2 93.1
Vowels H3 79.7
Short long H4 95.5
Words
Panachakel & Ramakrishnan (2021) MPC, Shallow neural network Syllables Kara One (Zhao & Rudzicz, 2015) 12 (8 m and 4 f/age mean 27.4) SynAmps RT Average: 75
Naebi & Feng (2023) LSIM M, A Private data 3 (age from 30–40) SynAmps 55–98
NCS
LBGC
DFD
Pan et al. (2024) WST, KPCA, XGBoost /ka:m/,/kwest/, /piəriə/, /əpas/,/spei/ Private data 7 (age from 22–46) Emotiv EPOC Flex 78.73
Wu et al. (2024) PSD, LDA /fɔ/, /gi/ 1. Bhadra, Giraud & Marchesotti (2025) 1. 15 (10 m and 5 f/age from 19–29) ANT Neuro system 62.2
2. Private data 2. 20 (7 m and 13 f/age from 20–30)

Note:

A, authors; M, model; P, pronunciation materials; D, datasets; S, subjects (number); DE, device; E, evaluation indicators (accuracy: %); m, males; f, females; GAF, Gramian Angular Field; DWT, discrete wavelet transformation; EEGOSESM, EEG oscillation and speech envelope synchronization model; 3DCE, 3D compression encoding; RTL, ResNet50 transfer learning; MPC, mean phase coherence; LSIM, lip-sync imagery model; NCS, new combinations of signals; LBGC, linear bond graph classifier; DFD, deep formula detection; WST, wavelet scattering transform; KPCA, kernel principal component analysis; XGBoost, Extreme Gradient Boosting; PSD, power spectral density; LDA, linear discriminant analysis; H1, highest long words average accuracy; H2, highest short words average accuracy; H3, highest vowels average accuracy; H4, highest short long words average accuracy.

Algorithm comparison and evaluation

The compilation and analysis of existing SI-BCI neural decoding algorithms are summarized in Fig. 12. The following trends emerge from a quantitative analysis: Machine learning methods predominantly utilize SVM, which demonstrate stable performance and high classification accuracy, making them the baseline method in this field and accounting for the majority of applications. Among deep learning algorithms, CNNs and their variants have demonstrated exceptional performance.

Figure 12. Comparison of the number of machine learning algorithms and deep learning algorithms.

Figure 12

Model performance evaluation involves various metrics, such as accuracy, precision, recall, F1-score, confusion matrix, receiver operating characteristic (ROC) curves, and area under the curve (AUC). Each metric emphasizes different aspects of model performance, making the selection of appropriate evaluation criteria critical. In the field of SI-BCI, average accuracy is the primary performance metric for evaluating SI-BCI algorithms. Additionally, metrics such as recall, precision, and F1-score are often included to offer a more comprehensive assessment of model performance. Thus, accuracy is chosen as the main evaluation metric in this study to facilitate clear comparison of the relative performance of various models. For both binary and multi-class classification tasks, accuracy is defined as the proportion of correctly classified samples to the total number of samples. However, in multi-class classification, accuracy calculations must account for multiple categories, which makes comparisons between binary and multi-class tasks more nuanced. Table 9 presents the highest accuracy achieved by different models and methods.

Table 9. The optimal accuracy among various models.

Binary classification
B M E
CSP EMBCSP 89.42
SVM
CSP STWFBCSP 84.87
ELM
MFCCs SVM 88.3
SVM DTW
CSP WD 81.25–98.75
RVM LC
CNN MBBNPCNN 83.72
RNN Proto-Speech Kara One 99.89–99.92
CNN CapsNet-VI 93.32
CNN H1 93.4
CSP 3DCE H2 93.1
SVM RTL H3 79.7
H4 95.5

Note:

EMBCSP, evidential multi-band common spatial pattern; STWFBCSP, sub-time window filter bank common spatial pattern; DTW, dynamic time warping; WD, wavelet decomposition; LC, linear classifier; MBBNPCNN, parallel convolutional neural network based on multi-band brain networks; 3DCE, 3D compression encoding; RTL, ResNet50 transfer learning; CCF, cross-correlation function; B, baseline (model); M, methods (model); E, Evaluation indicators (accuracy: %); H1, highest long words average accuracy; H2, highest short words average accuracy; H3, highest vowels average accuracy; H4, highest short long words average accuracy; LFSICF, length first speech imagination classification framework; CM, covariance matrix; CNN, convolutional neural network; SVM, support vector machine; CSP, common spatial pattern, ELM, Extreme Learning Machine; MFCCs, Mel-frequency cepstral coefficients; RVM, relevance vector machine; RNN, recurrent neural network.

Challenges and prospects

A comprehensive literature review reveals that EEG-based encoding and decoding research for SI-BCI has made significant advancements in recent decades. Ongoing studies on brain region physiology, paradigm design, and decoding algorithm innovations are expected to gradually overcome challenges, such as unnatural SI-BCI brain-machine interactions. However, several unresolved issues remain and require further exploration. To guide future research on SI-BCI systems, we summarize the current challenges and propose potential research directions.

Difficulty in acquiring high-quality signals and the principles of brain region connectivity in speech processing

Challenges: Our analysis of Tables 18 indicates that most studies use EEG acquisition devices with 32 or more channels. However, signal quality is often influenced by participant characteristics, including age, gender, and handedness. Furthermore, prolonged experimental sessions may lead to participant fatigue, impacting attention and reaction speed, which increases EEG variability and noise artifacts (e.g., ocular and electromyographic artifacts), ultimately reducing signal quality (Zhang, Li & Chen, 2020; Borirakarawin & Punsawad, 2023; Kim, Lee & Lee, 2020; Lee et al., 2021). Additionally, the experimental environment plays a crucial role. Prolonged recording sessions increase the risk of environmental interference during signal acquisition. For instance, Jafferson et al. (2021) reported an 18% error rate in classification results due to electrical interference during data collection. In “SI-BCISI-BCI Paradigm”, we reviewed recent advancements in brain connectivity studies. Broca’s and Wernicke’s areas are known to be essential in speech processing—Broca’s area is responsible for speech production and articulation, whereas Wernicke’s area facilitates speech comprehension. However, many existing studies collect EEG signals from the entire brain without focusing on electrode placement in these critical speech-related regions (Wang et al., 2021; Sikdar, Roy & Mahadevappa, 2018; Zhao, Liu & Gao, 2021; Pan et al., 2024; Wang et al., 2020).

Future Directions: To advance the understanding of speech-related brain mechanisms, the following research directions are proposed: Interdisciplinary research–Strengthen collaboration with fields such as anatomy and neuroscience to conduct more precise structural and functional analyses of speech-related brain regions, particularly Broca’s and Wernicke’s areas. This will enable deeper investigation into the neural signal transmission and cortical activation mechanisms underlying speech intention, potentially uncovering functional micro-units within these key linguistic centers (Zhang, Guo & Chen, 2023). Optimization of EEG acquisition equipment–Design and refine more efficient and sensitive EEG acquisition devices capable of capturing a broader range of neural activity frequencies. Enhancing the precision of EEG recordings will improve the detection of subtle neural activities involved in speech generation. These research directions hold promise for revealing the high-level cognitive functions underlying speech generation and comprehension and advancing the development of high-sensitivity EEG systems for applications in neurolinguistics.

Innovation in SI-BCI paradigms: limited focus on words and sentences

Challenges: As summarized in “SI-BCI Neural Encoding” and Tables 18, vowel imagery has been extensively studied (Chengaiyan & Anandan, 2022; DaSalla et al., 2009; Cooney, Folli & Coyle, 2018; Wu & Chen, 2020; Agarwal et al., 2020), whereas research on short words, long words, and phrases—such as English words and Chinese characters—is relatively scarce (Alizadeh & Omranpour, 2023; Kamble, Ghare & Kumar, 2022; Sereshkeh et al., 2017; Panachakel & Ramakrishnan, 2022). This limitation may stem from the design of existing paradigms, as a single paradigm may not be sufficient to fully address complex speech imagery tasks.

Future Directions: To innovate SI-BCI paradigms, future research should focus on designing paradigms that incorporate word-, character-, phrase-, and sentence-level speech intention tasks. Additionally, integrating multimodal signals (e.g., electromyography (EMG), eye movement signals) and multimodal paradigms (e.g., motor imagery and external stimuli) could improve the naturalness and usability of speech intention paradigms for real-world applications. Wang et al. (2019, 2022) proposed a hybrid BCI concept by designing a three-class task system that combines motor imagery and speech imagery, leading to improved classification accuracy. This suggests that multimodal data fusion can significantly enhance the efficiency and precision of BCI systems. Tong et al. (2023) further explored multimodal speech imagery applications by integrating silent reading and handwriting imagery, which elicited stronger EEG features compared to traditional motor imagery paradigms, thereby improving classification accuracy. Naebi & Feng (2023) proposed a novel communication imagery model as an alternative to speech-dependent mental tasks. They introduced a lip imagery model, achieving classification results ranging from 55% to 98%, with the highest performance achieved through a novel signal combination approach. Notably, Silva et al. (2024) leveraged BCI and natural language processing (NLP) technologies to enable a patient to communicate in Spanish and English. This groundbreaking study may provide new insights into the development of bilingual BCIs.

Excessive signal preprocessing and lack of multimodal data fusion in SI-BCI neural encoding

Challenges: Some studies have applied Autoencoder layers to handle incomplete and noisy EEG signals (Zhang et al., 2018; Ali, Mumtaz & Maqsood, 2023), while others have employed noise-assisted multivariate empirical mode decomposition (NA-MEMD) for signal processing (Park & Lee, 2023). However, these methods significantly increase preprocessing time, hindering real-time applications. He et al. (2023) highlighted that relying solely on EEG signals for unimodal data processing limits the model’s ability to capture the complexity of brain activity.

Future Directions: Beyond EEG, other physiological signals—such as fNIRS, electrooculography (EOG), and EMG—should be integrated into SI-BCI neural encoding for imagined speech decoding. Although most research focuses on single-modal signals, multimodal classification has emerged as a promising direction (Wang et al., 2023). Guo & Chen (2022) developed an fNIRS-based 4-class speech imagery BCI, achieving >70% mean subject-specific accuracy in binary-class settings (all pairwise comparisons between two vowels) across 0–2.5 s and 0–10 s time windows. Vorreuther et al. (2023) and Rezazadeh Sereshkeh et al. (2019) proposed a multi-class hybrid fNIRS-EEG BCI for imagined speech, demonstrating significantly higher accuracy than unimodal approaches. These findings highlight the complementary nature of diverse neural signals in enhancing BCI performance. Future studies should adopt mobile, wearable sensors to collect multimodal physiological data (facial expressions, EEG, etc.) and fuse them within a unified framework to improve online real-time feature extraction and imagined speech recognition accuracy.

Limited SI-BCI datasets and overreliance on binary classification tasks

Challenges: In Tables 18 and “Difficulty in Acquiring High-Quality Signals and the Principles of Brain Region Connectivity in Speech Processing”, we analyzed available datasets, revealing that most studies rely on private datasets with fewer than ten participants. This poses a significant challenge for neural decoding, as the lack of large datasets prevents deep learning models from leveraging large-scale data analysis, thus affecting model performance. Additionally, most studies still focus on binary classification tasks, limiting their applicability to real-world BCI applications. Furthermore, the absence of diverse evaluation metrics hinders comprehensive performance assessment.

Future Directions: In SI-BCI neural decoding, it is crucial to continue advancing deep learning algorithms, particularly CNN-based models, to enhance the capability of handling complex, nonlinear transformations. Expanding research from binary classification to multiclass classification is essential. For example, Hernandez-Galvan, Ramirez-Alonso & Ramirez-Quintana (2023) achieved high accuracy across multiple classes using two public datasets, demonstrating the feasibility of multiclass SI-BCI tasks. Current public datasets are predominantly vowel-based, such as Kara One and AUS. Therefore, developing high-quality word- and sentence-level public datasets is a critical direction for future SI-BCI advancements. A recent contribution in this area is ChineseEEG, a high-density EEG dataset compiled by Mou et al. (2024), containing approximately 13 h of Chinese text reading EEG recordings from 10 participants. This dataset fills a significant gap in Chinese language EEG datasets and represents a major step toward improving linguistic diversity in SI-BCI research.

Conclusions

In recent years, with the rapid advancement of brain-computer interface (BCI) technology, research on speech imagery decoding based on electroencephalography (EEG) has garnered significant attention. This article systematically reviews the latest progress in EEG-based speech imagery BCI decoding, focusing on key research findings and challenges in brain region connectivity, SI-BCI experimental paradigms, encoding techniques, and decoding algorithms. Through an analysis of existing studies, we suggest that future research should further emphasize brain region mechanisms, paradigm innovation, encoding principles, and decoding algorithms to advance the practical application of speech imagery BCI. With ongoing technological advancements, personalized and miniaturized speech intention BCI systems are expected to emerge. In the medical field, speech imagery BCI can assist aphasia patients in restoring communication abilities and improving their quality of life. In non-medical domains, speech imagery BCI can be applied to human-computer interaction and smart home systems, providing a more natural and efficient interaction modality. We hope this article will inspire further innovation and development in BCI paradigms and neural encoding-decoding research.

Supplemental Information

Supplemental Information 1. PRISMA_2020_checklist of Review.
peerj-cs-11-2938-s001.docx (271.3KB, docx)
DOI: 10.7717/peerj-cs.2938/supp-1
Supplemental Information 2. Summary of references.
DOI: 10.7717/peerj-cs.2938/supp-2

Funding Statement

This research was funded by the National Natural Science Foundation of China (No. 51405252), the China Postdoctoral Science Foundation of China, grant number (No. 2020M672101), the International Cooperation Fund of Qilu University of Technology of China, grant number (No. QLUTGJHZ2018022) and the Innovation Team Fund of Qilu University of Technology, and the Cooperative Innovation Fund of Qilu University of Technology of China (No. 2021CXY-04). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Additional Information and Declarations

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Ke Su conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Liang Tian conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

This is a literature review.

References

  • Aboussaleh et al. (2024).Aboussaleh I, Riffi J, Fazazy KE, Mahraz AM, Tairi H. STCPU-Net: advanced U-shaped deep learning architecture based on Swin transformers and capsule neural network for brain tumor segmentation. Neural Computing and Applications. 2024;36(30):18549–18565. doi: 10.1007/s00521-024-10144-y. [DOI] [Google Scholar]
  • Agarwal et al. (2020).Agarwal P, Kale RK, Kumar M, Kumar S. Silent speech classification based upon various feature extraction methods. 2020 7th International Conference on Signal Processing and Integrated Networks (SPIN); Piscataway: IEEE; 2020. pp. 16–20. [DOI] [Google Scholar]
  • Ahn et al. (2022).Ahn HJ, Lee DH, Jeong JH, Lee SW. Multiscale convolutional transformer for EEG classification of mental imagery in different modalities. IEEE Transactions on Neural Systems and Rehabilitation Engineering. 2022;31:646–656. doi: 10.1109/TNSRE.2022.3229330. [DOI] [PubMed] [Google Scholar]
  • Ali, Mumtaz & Maqsood (2023).Ali S, Mumtaz W, Maqsood A. EEG based thought-to-text translation via deep learning. 2023 7th International Multi-Topic ICT Conference (IMTIC); Piscataway: IEEE; 2023. pp. 1–8. [DOI] [Google Scholar]
  • Alizadeh & Omranpour (2023).Alizadeh D, Omranpour H. EM-CSP: an efficient multiclass common spatial pattern feature method for speech imagery EEG signals recognition. Biomedical Signal Processing and Control. 2023;84(4):104933. doi: 10.1016/j.bspc.2023.104933. [DOI] [Google Scholar]
  • Alzahrani, Banjar & Mirza (2024).Alzahrani S, Banjar H, Mirza R. Systematic review of EEG-based imagined speech classification methods. Sensors. 2024;24(24):8168. doi: 10.3390/s24248168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • ATR brainliner (2015).ATR brainliner A platform for sharing time-aligned brain activity and behavioral data. 2015. http://www.brainliner.jp/restrictedProject.atr. [11 October 2015]. http://www.brainliner.jp/restrictedProject.atr
  • BCI Competition Committee (2022).BCI Competition Committee 2020 International BCI Competition. 2022. https://doi.org/10.17605/OSF.IO/PQ7VB https://doi.org/10.17605/OSF.IO/PQ7VB
  • Belaoucha & Papadopoulo (2020).Belaoucha B, Papadopoulo T. Structural connectivity to reconstruct brain activation and effective connectivity between brain regions. Journal of Neural Engineering. 2020;17(3):035006. doi: 10.1088/1741-2552/ab8b2b. [DOI] [PubMed] [Google Scholar]
  • Beyeler et al. (2016).Beyeler A, Namburi P, Glober GF, Simonnet C, Calhoon GG, Conyers GF, Luck R, Wildes CP, Tye KM. Divergent routing of positive and negative information from the amygdala during memory retrieval. Neuron. 2016;90(2):348–361. doi: 10.1016/j.neuron.2016.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Bhadra, Giraud & Marchesotti (2025).Bhadra K, Giraud AL, Marchesotti S. Learning to operate an imagined speech Brain-Computer Interface involves the spatial and frequency tuning of neural activity. Communications Biology. 2025;8:271. doi: 10.1038/s42003-025-07464-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Bisla & Anand (2023).Bisla M, Anand RS. Analysis of imagined speech characteristics using phase-based connectivity measures. 2023 IEEE 13th International Conference on Control System, Computing and Engineering (ICCSCE); Piscataway: IEEE; 2023. pp. 68–73. [DOI] [Google Scholar]
  • Borirakarawin & Punsawad (2023).Borirakarawin M, Punsawad Y. Hybrid brain-computer interface system using auditory stimulation and speech imagination paradigms for assistive technology. IEEE Access. 2023;11:53079–53090. doi: 10.1109/ACCESS.2023.3280636. [DOI] [Google Scholar]
  • Cai et al. (2024).Cai G, Zhang F, Yang B, Huang S, Ma T. Manifold learning-based common spatial pattern for EEG signal classification. IEEE Journal of Biomedical and Health Informatics. 2024;28(4):1971–1981. doi: 10.1109/JBHI.2024.3357995. [DOI] [PubMed] [Google Scholar]
  • Chen & Pan (2020).Chen F, Pan C. A review on brain-computer interface technologies for speech imagery. Journal of Signal Processing. 2020;36(6):816–830. doi: 10.16798/j.issn.1003-0530.2020.06.003. [DOI] [Google Scholar]
  • Chengaiyan & Anandan (2022).Chengaiyan S, Anandan K. Effect of functional and effective brain connectivity in identifying vowels from articulation imagery procedures. Cognitive Processing. 2022;23(4):593–618. doi: 10.1007/s10339-022-01103-3. [DOI] [PubMed] [Google Scholar]
  • Chengaiyan & Anandhan (2015).Chengaiyan S, Anandhan K. Analysis of speech imagery using functional and effective EEG based brain connectivity parameters. International Journal of Cognitive Informatics and Natural Intelligence. 2015;9(4):33–48. doi: 10.4018/IJCINI.2015100103. [DOI] [Google Scholar]
  • Chengaiyan et al. (2018).Chengaiyan S, Balathayil D, Anandan K, Thomas CB. Effect of power and phase synchronization in multi-trial speech imagery. International Journal of Software Science and Computational Intelligence. 2018;10(4):44–61. doi: 10.4018/IJSSCI.2018100104. [DOI] [Google Scholar]
  • Chengaiyan, Retnapandian & Anandan (2020).Chengaiyan S, Retnapandian AS, Anandan K. Identification of vowels in consonant-vowel–consonant words from speech imagery based EEG signals. Cognitive Neurodynamics. 2020;14:1–19. doi: 10.1007/s11571-019-09558-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Cooney, Folli & Coyle (2018).Cooney C, Folli R, Coyle D. Neurolinguistics research advancing development of a direct-speech brain-computer interface. iScience. 2018;8:103–125. doi: 10.1016/j.isci.2018.09.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Cooney, Folli & Coyle (2018).Cooney C, Folli R, Coyle D. Mel frequency cepstral coefficients enhance imagined speech decoding accuracy from EEG. 2018 29th Irish Signals and Systems Conference (ISSC); Piscataway: IEEE; 2018. pp. 1–7. [DOI] [Google Scholar]
  • Cui et al. (2023).Cui W, Wang X, Li M, Pun SH, Chen F. A study of deep learning based classification of mandarin vowels using spoken speech EEG signals. 2023 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC); Piscataway: IEEE; 2023. pp. 1–5. [DOI] [Google Scholar]
  • DaSalla et al. (2009).DaSalla CS, Kambara H, Sato M, Koike Y. Single-trial classification of vowel speech imagery using common spatial patterns. Neural Networks. 2009;22(9):1334–1339. doi: 10.1016/j.neunet.2009.05.008. [DOI] [PubMed] [Google Scholar]
  • Gomez et al. (2017).Gomez AN, Ren M, Urtasun R, Grosse RB. The reversible residual network: backpropagation without storing activations. Advances in neural information processing systems, 30; Red Hook: Curran Associates, Inc; 2017. [Google Scholar]
  • Guo & Chen (2022).Guo Z, Chen F. Idle-state detection in motor imagery of articulation using early information: a functional Near-infrared spectroscopy study. Biomedical Signal Processing and Control. 2022;72(6):103369. doi: 10.1016/j.bspc.2021.103369. [DOI] [Google Scholar]
  • He et al. (2019).He B, Astolfi L, Valdés-Sosa PA, Marinazzo D, Palva SO, Bénar CG, Michel CM, Koenig T. Electrophysiological brain connectivity: theory and implementation. IEEE Transactions on Biomedical Engineering. 2019;66(7):2115–2137. doi: 10.1109/TBME.2019.2913928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • He et al. (2023).He M, Wu Y, Li Z, Wang S, Li W, Zhou W, Rong H, Wang J. HMT: an EEG signal classification method based on CNN architecture. 2023 5th International Conference on Intelligent Control, Measurement and Signal Processing (ICMSP); Piscataway: IEEE; 2023. pp. 1015–1018. [DOI] [Google Scholar]
  • Hernandez-Galvan, Ramirez-Alonso & Ramirez-Quintana (2023).Hernandez-Galvan A, Ramirez-Alonso G, Ramirez-Quintana J. A prototypical network for few-shot recognition of speech imagery data. Biomedical Signal Processing and Control. 2023;86:105154. doi: 10.1016/j.bspc.2023.105154. [DOI] [Google Scholar]
  • Huang et al. (2022).Huang X, Wang L, Ren L, Zhan Q. EEG signals recognition of active brain-computer interface based on sequential encoding. 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP); Piscataway: IEEE; 2022. pp. 1821–1826. [DOI] [Google Scholar]
  • Hurlburt et al. (2016).Hurlburt RT, Alderson-Day B, Kühn S, Fernyhough C. Exploring the ecological validity of thinking on demand: neural correlates of elicited vs. spontaneously occurring inner speech. PLOS ONE. 2016;11(2):e0147932. doi: 10.1371/journal.pone.0147932. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Idrees & Farooq (2016a).Idrees BM, Farooq O. EEG based vowel classification during speech imagery. 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom); Piscataway: IEEE; 2016a. pp. 1130–1134. [Google Scholar]
  • Idrees & Farooq (2016b).Idrees BM, Farooq O. Vowel classification using wavelet decomposition during speech imagery. 2016 3rd International Conference on Signal Processing and Integrated Networks (SPIN); Piscataway: IEEE; 2016b. pp. 636–640. [DOI] [Google Scholar]
  • Islam & Shuvo (2019).Islam MM, Shuvo MMH. DenseNet based speech imagery EEG signal classification using Gramian Angular Field. 2019 5th International Conference on Advances in Electrical Engineering (ICAEE); Piscataway: IEEE; 2019. pp. 149–154. [DOI] [Google Scholar]
  • Jafferson et al. (2021).Jafferson AJ, Ponnusamy V, Jovic J, Trajanovic M. An IoT based cloud EEG signal analytic framework for thought to text mapping. IEIE Transactions on Smart Processing & Computing. 2021;10(3):183–188. doi: 10.5573/IEIESPC.2021.10.3.183. [DOI] [Google Scholar]
  • Jahangiri & Sepulveda (2017).Jahangiri A, Sepulveda F. The contribution of different frequency bands in class separability of covert speech tasks for BCIs. 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); IEEE; 2017. pp. 2093–2096. [DOI] [PubMed] [Google Scholar]
  • Jeong et al. (2022).Jeong JH, Cho JH, Lee BH, Lee SW. Real-time deep neurolinguistic learning enhances noninvasive neural language decoding for brain-machine interaction. IEEE Transactions on Cybernetics. 2022;53(12):7469–7482. doi: 10.1109/TCYB.2022.3211694. [DOI] [PubMed] [Google Scholar]
  • Ji et al. (2024).Ji H, Liu X, Zhang J, Liu L. Spatial localization of a transformer robot based on ultrasonic signal wavelet decomposition and PHAT-β-γ generalized cross correlation. Sensors. 2024;24(5):1440. doi: 10.3390/s24051440. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Jäncke, Liem & Merillat (2021).Jäncke L, Liem F, Merillat S. Are language skills related to structural features in Broca’s and Wernicke’s area? European Journal of Neuroscience. 2021;53(4):1124–1135. doi: 10.1111/ejn.15038. [DOI] [PubMed] [Google Scholar]
  • Kalaganis et al. (2023).Kalaganis FP, Georgiadis K, Oikonomou VP, Nikolopoulos S, Laskaris NA, Kompatsiaris I. Exploiting approximate joint diagonalization for covariance estimation in imagined speech decoding. International Conference on Brain Informatics; Cham: Springer Nature Switzerland; 2023. pp. 409–419. [DOI] [Google Scholar]
  • Kamble, Ghare & Kumar (2022).Kamble A, Ghare PH, Kumar V. Deep-learning-based BCI for automatic imagined speech recognition using SPWVD. IEEE Transactions on Instrumentation and Measurement. 2022;72(2):4001110. doi: 10.1109/TIM.2022.3216673. [DOI] [Google Scholar]
  • Kaongoen et al. (2023).Kaongoen N, Choi J, Choi JW, Kwon H, Hwang C, Hwang G, Kim BH, Jo S. The future of wearable EEG: a review of ear-EEG technology and its applications. Journal of Neural Engineering. 2023;20(5):051002. doi: 10.1088/1741-2552/acfcda. [DOI] [PubMed] [Google Scholar]
  • Kaongoen, Choi & Jo (2021).Kaongoen N, Choi J, Jo S. Speech-imagery-based brain-computer interface system using ear-EEG. Journal of Neural Engineering. 2021;18(1):016023. doi: 10.1088/1741-2552/abd10e. [DOI] [PubMed] [Google Scholar]
  • Khatti & Grover (2024).Khatti J, Grover KS. Prediction of uniaxial strength of rocks using relevance vector machine improved with dual kernels and metaheuristic algorithms. Rock Mechanics and Rock Engineering. 2024;57:6227–6258. doi: 10.1007/s00603-024-03849-y. [DOI] [Google Scholar]
  • Kim, Lee & Lee (2020).Kim HJ, Lee MH, Lee M. A BCI based smart home system combined with event-related potentials and speech imagery task. 2020 8th International Winter Conference on Brain-Computer Interface (BCI); Piscataway: IEEE; 2020. pp. 1–6. [DOI] [Google Scholar]
  • Kristensen, Subhi & Puthusserypady (2020).Kristensen AB, Subhi Y, Puthusserypady S. Vocal imagery vs intention: viability of vocal-based EEG-BCI paradigms. IEEE Transactions on Neural Systems and Rehabilitation Engineering. 2020;28(8):1750–1759. doi: 10.1109/TNSRE.2020.3004924. [DOI] [PubMed] [Google Scholar]
  • Kwon et al. (2023).Kwon H, Choi J, Choi JW, Jo S. Subject-aware user state classification with deep learning models: an exploratory study. 2023 11th International Winter Conference on Brain-Computer Interface (BCI); Piscataway: IEEE; 2023. pp. 1–4. [DOI] [Google Scholar]
  • Lee et al. (2021).Lee DH, Jeong JH, Ahn HJ, Lee SW. Design of an EEG-based drone swarm control system using endogenous BCI paradigms. 2021 9th International Winter Conference on Brain-Computer Interface (BCI); Piscataway: IEEE; 2021. pp. 1–5. [DOI] [Google Scholar]
  • Lee et al. (2021).Lee BH, Kwon BH, Lee DY, Jeong JH. Speech imagery classification using length-wise training based on deep learning. 2021 9th International Winter Conference on Brain-Computer Interface (BCI); Piscataway: IEEE; 2021. pp. 1–5. [DOI] [Google Scholar]
  • Lee, Lee & Lee (2020).Lee SH, Lee M, Lee SW. Neural decoding of imagined speech and visual imagery as intuitive paradigms for BCI communication. IEEE Transactions on Neural Systems and Rehabilitation Engineering. 2020;28(12):2647–2659. doi: 10.1109/TNSRE.2020.3040289. [DOI] [PubMed] [Google Scholar]
  • Li & Chen (2020).Li H, Chen F. Classify imaginary mandarin tones with cortical EEG signals. INTERSPEECH; 2020. pp. 4896–4900. [DOI] [Google Scholar]
  • Li, Pun & Chen (2021).Li M, Pun SH, Chen F. A preliminary study of classifying spoken vowels with EEG signals. 2021 10th International IEEE/EMBS Conference on Neural Engineering (NER); Piscataway: IEEE; 2021. pp. 13–16. [DOI] [Google Scholar]
  • Li et al. (2024).Li H, Qiao R, Yu P, Li H, Tan M. CTHD-Net: CNN-transformer hybrid dehazing network via residual global attention and gated boosting strategy. Journal of Visual Communication and Image Representation. 2024;99:104066. doi: 10.1016/j.jvcir.2024.104066. [DOI] [Google Scholar]
  • Liang et al. (2023).Liang Y, Tan J, Xie Z, Chen Z, Lin D, Yang Z. Research on convolutional neural network inference acceleration and performance optimization for edge intelligence. Sensors. 2023;24(1):240. doi: 10.3390/s24010240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Liu et al. (2022).Liu Y, Gong A, Ding P, Zhao L, Qian Q, Zhou J, Su L, Fu Y. Key technology of brain-computer interaction based on speech imagery. Journal of Biomedical Engineering. 2022;39(3):596–611. doi: 10.7507/1001-5515.202107018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Lopez-Bernal et al. (2022).Lopez-Bernal D, Balderas D, Ponce P, Molina A. A state-of-the-art review of EEG-based imagined speech decoding. Frontiers in Human Neuroscience. 2022;16:867281. doi: 10.3389/fnhum.2022.867281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Lu et al. (2023).Lu L, Han M, Zou G, Zheng L, Gao JH. Common and distinct neural representations of imagined and perceived speech. Cerebral Cortex. 2023;33(10):6486–6493. doi: 10.1093/cercor/bhac519. [DOI] [PubMed] [Google Scholar]
  • Luo et al. (2024).Luo H, Guo Z, Chen W, Gao X, Fan J. Structural health assessment of historical buildings with nonuniform FE analysis and modified covariance matrix adaptation evaluation strategy (CMAES) updating. Automation in Construction. 2024;166:105618. doi: 10.1016/j.autcon.2024.105618. [DOI] [Google Scholar]
  • Macías-Macías et al. (2023).Macías-Macías JM, Ramírez-Quintana JA, Chacón-Murguía MI, Torres-García AA, Corral-Martínez LF. Interpretation of a deep analysis of speech imagery features extracted by a capsule neural network. Computers in Biology and Medicine. 2023;159(17):106909. doi: 10.1016/j.compbiomed.2023.106909. [DOI] [PubMed] [Google Scholar]
  • Macías-Macías et al. (2020).Macías-Macías JM, Ramírez-Quintana JA, Ramírez-Alonso G, Chacón-Murguía MI. Deep learning networks for vowel speech imagery. 2020 17th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE); Piscataway: IEEE; 2020. pp. 1–6. [DOI] [Google Scholar]
  • Mahapatra & Bhuyan (2022).Mahapatra NC, Bhuyan P. Multiclass classification of imagined speech vowels and words of electroencephalography signals using deep learning. Advances in Human-Computer Interaction. 2022;2022:1374880. doi: 10.1155/2022/1374880. [DOI] [Google Scholar]
  • Martin et al. (2016).Martin S, Brunner P, Iturrate I, Millán JDR, Schalk G, Knight RT, Pasley BN. Word pair classification during imagined speech using direct brain recordings. Scientific Reports. 2016;6:25803. doi: 10.1038/srep25803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Min et al. (2016).Min B, Kim J, Park HJ, Lee B. Vowel imagery decoding toward silent speech BCI using extreme learning machine with electroencephalogram. BioMed Research International. 2016;2016:2618265. doi: 10.1155/2016/2618265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Moattari, Parnianpour & Moradi (2017).Moattari M, Parnianpour P, Moradi MH. Independent component analysis approach using higher orders of Non-Gaussianity. 2017 Iranian Conference on Electrical Engineering (ICEE); Piscataway: IEEE; 2017. pp. 49–54. [DOI] [Google Scholar]
  • Mou et al. (2024).Mou X, He C, Tan L, Yu J, Liang H, Zhang J, Tian Y, Yang Y, Xu T, Wang Q, Cao M, Chen Z, Hu C, Wang X, Liu Q, Wu H. ChineseEEG: a Chinese linguistic corpora EEG dataset for semantic alignment and neural decoding. Scientific Data. 2024;11:550. doi: 10.1038/s41597-024-03398-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Naebi & Feng (2023).Naebi A, Feng Z. The performance of a Lip-Sync imagery model, new combinations of signals, a supplemental bond graph classifier, and deep formula detection as an extraction and root classifier for electroencephalograms and brain-computer interfaces. Applied Sciences. 2023;13(21):11787. doi: 10.3390/app132111787. [DOI] [Google Scholar]
  • Nguyen, Karavas & Artemiadis (2017).Nguyen CH, Karavas GK, Artemiadis P. Inferring imagined speech using EEG signals: a new approach using Riemannian manifold features. Journal of Neural Engineering. 2017;15(1):016002. doi: 10.1088/1741-2552/aa8235. [DOI] [PubMed] [Google Scholar]
  • Nguyen, Karavas & Artemiadis (2019).Nguyen CH, Karavas GK, Artemiadis P. Adaptive multi-degree of freedom Brain Computer Interface using online feedback: towards novel methods and metrics of mutual adaptation between humans and machines for BCI. PLOS ONE. 2019;14(3):e0212620. doi: 10.1371/journal.pone.0212620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Nitta et al. (2023).Nitta T, Horikawa J, Iribe Y, Taguchi R, Katsurada K, Shinohara S, Kawai G. Linguistic representation of vowels in speech imagery EEG. Frontiers in Human Neuroscience. 2023;17:1163578. doi: 10.3389/fnhum.2023.1163578. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Ono et al. (2022).Ono Y, Zhang X, Noah JA, Dravida S, Hirsch J. Bidirectional connectivity between Broca’s area and Wernicke’s area during interactive verbal communication. Brain Connectivity. 2022;12(3):210–222. doi: 10.1089/brain.2020.0790. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Page et al. (2021).Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw JM, Hróbjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E, McDonald S, McGuinness LA, Stewart LA, Thomas J, Tricco AC, Welch VA, Whiting P, Moher D. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. doi: 10.1136/bmj.n71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Pan et al. (2023).Pan H, Li Z, Tian C, Wang L, Fu Y, Qin X, Liu F. The LightGBM-based classification algorithm for Chinese characters speech imagery BCI system. Cognitive Neurodynamics. 2023;17(2):373–384. doi: 10.1007/s11571-022-09819-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Pan et al. (2022).Pan C, Liu H, Zheng D, Chen F. Neural entrainment to rhythms of imagined syllables. 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC); Piscataway: IEEE; 2022. pp. 4040–4043. [DOI] [PubMed] [Google Scholar]
  • Pan et al. (2024).Pan H, Wang Y, Li Z, Chu X, Teng B, Gao H. A complete scheme for multi-character classification using EEG signals from speech imagery. IEEE Transactions on Biomedical Engineering. 2024;71(8):2454–2462. doi: 10.1109/TBME.2024.3376603. [DOI] [PubMed] [Google Scholar]
  • Panachakel & Ganesan (2021).Panachakel JT, Ganesan RA. Decoding imagined speech from EEG using transfer learning. IEEE Access. 2021;9:135371–135383. doi: 10.1109/ACCESS.2021.3116196. [DOI] [Google Scholar]
  • Panachakel & Ramakrishnan (2021).Panachakel JT, Ramakrishnan AG. Decoding covert speech from EEG-a comprehensive review. Frontiers in Neuroscience. 2021;15:642251. doi: 10.3389/fnins.2021.642251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Panachakel & Ramakrishnan (2021).Panachakel JT, Ramakrishnan AG. Classification of phonological categories in imagined speech using phase synchronization measure. 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC); Piscataway: IEEE; 2021. pp. 2226–2229. [DOI] [PubMed] [Google Scholar]
  • Panachakel & Ramakrishnan (2022).Panachakel JT, Ramakrishnan AG. DCLL—a deep network for possible real-time decoding of imagined words. International Symposium on Intelligent Informatics; Singapore: Springer Nature Singapore; 2022. pp. 3–12. [DOI] [Google Scholar]
  • Park & Lee (2023).Park HJ, Lee B. Multiclass classification of imagined speech EEG using noise-assisted multivariate empirical mode decomposition and multireceptive field convolutional neural network. Frontiers in Human Neuroscience. 2023;17:1186594. doi: 10.3389/fnhum.2023.1186594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Park, Yeom & Sim (2021).Park SM, Yeom HG, Sim KB. User state classification based on functional brain connectivity using a convolutional neural network. Electronics. 2021;10(10):1158. doi: 10.3390/electronics10101158. [DOI] [Google Scholar]
  • Pawar & Dhage (2022).Pawar D, Dhage S. Imagined speech classification using EEG based brain-computer interface. 2022 IEEE 11th International Conference on Communication Systems and Network Technologies (CSNT); Piscataway: IEEE; 2022. pp. 662–666. [DOI] [Google Scholar]
  • Peña Serrano, Jaimes-Reátegui & Pisarchik (2024).Peña Serrano N, Jaimes-Reátegui R, Pisarchik AN. Hypergraph of functional connectivity based on event-related coherence: magnetoencephalography data analysis. Applied Sciences. 2024;14(6):2343. doi: 10.3390/app14062343. [DOI] [Google Scholar]
  • Qureshi et al. (2017).Qureshi MNI, Min B, Park HJ, Cho D, Choi W, Lee B. Multiclass classification of word imagination speech with hybrid connectivity features. IEEE Transactions on Biomedical Engineering. 2017;65(10):2168–2177. doi: 10.1109/TBME.2017.2786251. [DOI] [PubMed] [Google Scholar]
  • Rahman et al. (2024).Rahman N, Khan DM, Masroor K, Arshad M, Rafiq A, Fahim SM. Advances in brain-computer interface for decoding speech imagery from EEG signals: a systematic review. Cognitive Neurodynamics. 2024;18:3565–3583. doi: 10.1007/s11571-024-10167-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Ramirez-Quintana et al. (2023).Ramirez-Quintana JA, Macias-Macias JM, Ramirez-Alonso G, Chacon-Murguia MI, Corral-Martinez LF. A novel deep capsule neural network for vowel imagery patterns from EEG signals. Biomedical Signal Processing and Control. 2023;81:104500. doi: 10.1016/j.bspc.2022.104500. [DOI] [Google Scholar]
  • Rekrut, Selim & Krüger (2022).Rekrut M, Selim AM, Krüger A. Improving silent speech BCI training procedures through transfer from overt to silent speech. 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC); Piscataway: IEEE; 2022. pp. 2650–2656. [DOI] [Google Scholar]
  • Retnapandian & Anandan (2023).Retnapandian AS, Anandan K. Phoneme-based imagined vowel identification from electroencephalographic sub-band oscillations during speech imagery procedures. Journal of Scientific & Industrial Research. 2023;82(07):756. doi: 10.56042/jsir.v82i07.2899. [DOI] [Google Scholar]
  • Rezazadeh Sereshkeh et al. (2019).Rezazadeh Sereshkeh A, Yousefi R, Wong AT, Rudzicz F, Chau T. Development of a ternary hybrid fNIRS-EEG brain-computer interface based on imagined speech. Brain-Computer Interfaces. 2019;6(4):128–140. doi: 10.1080/2326263X.2019.1698928. [DOI] [Google Scholar]
  • Rezvani et al. (2024).Rezvani S, Pourpanah F, Lim CP, Wu QJ. Methods for class-imbalanced learning with support vector machines: a review and an empirical evaluation. Soft Computing. 2024;28:1–22. doi: 10.1007/s00500-024-09931-5. [DOI] [Google Scholar]
  • Rostami & Moradi (2015).Rostami M, Moradi MH. Evidential multi-band common spatial pattern in brain computer interface. 2015 22nd Iranian Conference on Biomedical Engineering (ICBME); Piscataway: IEEE; 2015. pp. 16–20. [DOI] [Google Scholar]
  • Saha & Fels (2019).Saha P, Fels S. Hierarchical deep feature learning for decoding imagined speech from EEG. Proceedings of the AAAI Conference on Artificial Intelligence; 2019. pp. 10019–10020. [DOI] [Google Scholar]
  • Saha, Fels & Abdul-Mageed (2019).Saha P, Fels S, Abdul-Mageed M. Deep learning the EEG manifold for phonological categorization from active thoughts. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Piscataway: IEEE; 2019. pp. 2762–2766. [DOI] [Google Scholar]
  • Saji et al. (2020).Saji A, Shanir PP, Sha AA, Kunju N. Analysis of EEG for classification vowel sounds. AIP Conference Proceedings; Melville: AIP Publishing; 2020. [DOI] [Google Scholar]
  • Sandhya et al. (2018).Sandhya C, Divya B, Kavitha A, Bobby TC. Influence of relative power in multi-trial speech imagery. 2018 IEEE 17th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC); Piscataway: IEEE; 2018. pp. 431–439. [DOI] [Google Scholar]
  • Sandhya et al. (2015).Sandhya C, Srinidhi G, Vaishali R, Visali M, Kavitha A. Analysis of speech imagery using brain connectivity estimators. 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC); Piscataway: IEEE; 2015. pp. 352–359. [DOI] [Google Scholar]
  • Schultz et al. (2017).Schultz T, Wand M, Hueber T, Krusienski DJ, Herff C, Brumberg JS. Biosignal-based spoken communication: a survey. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2017;25(12):2257–2271. doi: 10.1109/TASLP.2017.2752365. [DOI] [Google Scholar]
  • Sereshkeh et al. (2017).Sereshkeh AR, Trott R, Bricout A, Chau T. EEG classification of covert speech using regularized neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2017;25(12):2292–2300. doi: 10.1109/TASLP.2017.2758164. [DOI] [Google Scholar]
  • Sikdar et al. (2018).Sikdar D, Roy R, Bakshi K, Mahadevappa M. Multifractal analysis of speech imagery of IPA vowels. 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); Piscataway: IEEE; 2018. pp. 1–4. [DOI] [PubMed] [Google Scholar]
  • Sikdar, Roy & Mahadevappa (2018).Sikdar D, Roy R, Mahadevappa M. Chaos analysis of speech imagery of IPA vowels. International Conference on Intelligent Human Computer Interaction; Cham: Springer International Publishing; 2018. pp. 101–110. [DOI] [PubMed] [Google Scholar]
  • Silva et al. (2024).Silva AB, Liu JR, Metzger SL, Bhaya-Grossman I, Dougherty ME, Seaton MP, Littlejohn KT, Tu-Chan A, Ganguly K, Moses DA, Chang EF. A bilingual speech neuroprosthesis driven by cortical articulatory representations shared between languages. Nature Biomedical Engineering. 2024;8(8):977–991. doi: 10.1038/s41551-024-01207-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Sree, Kavitha & Divya (2023).Sree RA, Kavitha A, Divya B. Optimization of pre-processing routines in speech imagery-based EEG signals. Journal of Mechanics in Medicine and Biology. 2023;23(6):2340032. doi: 10.1142/S0219519423400328. [DOI] [Google Scholar]
  • Tai et al. (2024).Tai P, Ding P, Wang F, Gong A, Li T, Zhao L, Su L, Fu Y. Brain-computer interface paradigms and neural coding. Frontiers in Neuroscience. 2024;17:1345961. doi: 10.3389/fnins.2023.1345961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Tong et al. (2023).Tong J, Xing Z, Wei X, Yue C, Dong E, Du S, Sun Z, Solé-Casals J, Caiafa CF. Towards improving motor imagery brain-computer interface using multimodal speech imagery. Journal of Medical and Biological Engineering. 2023;43(3):216–226. doi: 10.1007/s40846-023-00798-9. [DOI] [Google Scholar]
  • Tsukahara et al. (2019).Tsukahara A, Yamada M, Tanaka K, Uchikawa Y. An examination of EEG frequency components related to speech imagery and its identification. IEEJ Transactions on Electronics, Information and Systems. 2019;139(5):588–595. doi: 10.1541/ieejeiss.139.588. [DOI] [Google Scholar]
  • Vorontsov et al. (2017).Vorontsov E, Trabelsi C, Kadoury S, Pal C. On orthogonality and learning recurrent networks with long term dependencies. International Conference on Machine Learning; Cambridge: PMLR; 2017. pp. 3570–3578. [Google Scholar]
  • Vorreuther et al. (2023).Vorreuther A, Bastian L, Benitez Andonegui A, Evenblij D, Riecke L, Lührs M, Sorger B. It takes two (seconds): decreasing encoding time for two-choice functional near-infrared spectroscopy brain-computer interface communication. Neurophotonics. 2023;10(4):045005. doi: 10.1117/1.NPh.10.4.045005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Wang et al. (2021).Wang L, Hu L, Wang J, Liang D. A personalized channel selection and spatial filtering model for brain-computer interface. Proceedings of the 2021 10th International Conference on Bioinformatics and Biomedical Science; New York: Association for Computing Machinery; 2021. pp. 100–105. [DOI] [Google Scholar]
  • Wang et al. (2022).Wang L, Huang X, Ren L, Zhan Q. Signal analysis and classification of a novel active brain-computer interface based on four-category sequential coding. Biomedical Signal Processing and Control. 2022;78(5):103857. doi: 10.1016/j.bspc.2022.103857. [DOI] [Google Scholar]
  • Wang et al. (2020).Wang L, Huang W, Yang Z, Hu X, Zhang C. A method from offline analysis to online training for the brain-computer interface based on motor imagery and speech imagery. Biomedical Signal Processing and Control. 2020;62(5):102100. doi: 10.1016/j.bspc.2020.102100. [DOI] [Google Scholar]
  • Wang et al. (2019).Wang L, Liu X, Liang Z, Yang Z, Hu X. Analysis and classification of hybrid BCI based on motor imagery and speech imagery. Measurement. 2019;147(5):106842. doi: 10.1016/j.measurement.2019.07.070. [DOI] [Google Scholar]
  • Wang et al. (2023).Wang ML, Shao W, Hao XK, Zhang DQ. Machine learning for brain imaging genomics methods: a review. Machine Intelligence Research. 2023;20:57–78. doi: 10.1007/s11633-022-1361-0. [DOI] [Google Scholar]
  • Wang & Wang (2022).Wang J, Wang L. Parallel convolutional neural network based on multi-band brain networks for EEG Classification. 2022 5th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE); Piscataway: IEEE; 2022. pp. 49–53. [DOI] [Google Scholar]
  • Wang et al. (2021).Wang L, Yan Z, Liu Y, Hu L. Analysis and application of functional connectivity in synchronic hybrid mental tasks for brain-computer interface. Measurement. 2021;186(5):110116. doi: 10.1016/j.measurement.2021.110116. [DOI] [Google Scholar]
  • Wang et al. (2016).Wang L, Zhang X, Zhong X, Fan Z. Improvement of mental tasks with relevant speech imagery for brain-computer interfaces. Measurement. 2016;91(4):201–209. doi: 10.1016/j.measurement.2016.05.054. [DOI] [Google Scholar]
  • Watanabe et al. (2020).Watanabe H, Tanaka H, Sakti S, Nakamura S. Synchronization between overt speech envelope and EEG oscillations during imagined speech. Neuroscience Research. 2020;153:48–55. doi: 10.1016/j.neures.2019.04.004. [DOI] [PubMed] [Google Scholar]
  • Wei et al. (2024).Wei X, Lang P, Li J, Feng K, Zhan Y. A hybrid optimization method based on extreme learning machine aided factor graph for INS/GPS information fusion during GPS outages. Aerospace Science and Technology. 2024;152:109326. doi: 10.1016/j.ast.2024.109326. [DOI] [Google Scholar]
  • Wolpaw, Millan & Ramsey (2020).Wolpaw JR, Millan JDR, Ramsey NF. Brain-computer interfaces: definitions and principles. Handbook of Clinical Neurology. 2020;168:15–23. doi: 10.1016/B978-0-444-63934-9.00002-0. [DOI] [PubMed] [Google Scholar]
  • Wu et al. (2024).Wu S, Bhadra K, Giraud AL, Marchesotti S. Adaptive LDA classifier enhances real-time control of an EEG brain-computer interface for decoding imagined syllables. Brain Sciences. 2024;14(3):196. doi: 10.3390/brainsci14030196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Wu & Chen (2020).Wu H, Chen F. A temporal envelope-based speech reconstruction approach with EEG signals during speech imagery. 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC); Piscataway: IEEE; 2020. pp. 894–899. [Google Scholar]
  • Xu et al. (2021).Xu L, Xu M, Jung TP, Ming D. Review of brain encoding and decoding mechanisms for EEG-based brain-computer interface. Cognitive Neurodynamics. 2021;15(4):569–584. doi: 10.1007/s11571-021-09676-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Zhang, Guo & Chen (2023).Zhang H, Guo Z, Chen F. The effects of different brain regions on fNIRS-based task-state detection in speech imagery. 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC); Piscataway: IEEE; 2023. pp. 1–4. [DOI] [PubMed] [Google Scholar]
  • Zhang, Li & Chen (2020).Zhang X, Li H, Chen F. EEG-based classification of imaginary Mandarin tones. 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC); Piscataway: IEEE; 2020. pp. 3889–3892. [DOI] [PubMed] [Google Scholar]
  • Zhang et al. (2018).Zhang X, Yao L, Sheng QZ, Kanhere SS, Gu T, Zhang D. Converting your thoughts to texts: enabling brain typing via deep feature learning of EEG signals. 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom); Piscataway: IEEE; 2018. pp. 1–10. [DOI] [Google Scholar]
  • Zhang et al. (2024).Zhang L, Zhou Y, Gong P, Zhang D. Speech imagery decoding using EEG signals and deep learning: a survey. IEEE Transactions on Cognitive and Developmental Systems. 2024;17(1):22–39. doi: 10.1109/TCDS.2024.3431224. [DOI] [Google Scholar]
  • Zhao, Liu & Gao (2021).Zhao Y, Liu Y, Gao Y. Analysis and classification of speech imagery EEG based on Chinese initials. Journal of Beijing Institute of Technology. 2021;30(zk):44–51. [Google Scholar]
  • Zhao & Rudzicz (2015).Zhao S, Rudzicz F. Classifying phonological categories in imagined and articulated speech. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Piscataway: IEEE; 2015. pp. 992–996. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Information 1. PRISMA_2020_checklist of Review.
peerj-cs-11-2938-s001.docx (271.3KB, docx)
DOI: 10.7717/peerj-cs.2938/supp-1
Supplemental Information 2. Summary of references.
DOI: 10.7717/peerj-cs.2938/supp-2

Data Availability Statement

The following information was supplied regarding data availability:

This is a literature review.


Articles from PeerJ Computer Science are provided here courtesy of PeerJ, Inc

RESOURCES