Abstract
In Human-Robot Interaction, speech is one of the most intuitive and effective communication channel. In Industry 4.0, speech-based communication can significantly enhance productivity and efficiency on production lines. However, deploying a Speech Command Recognition Module in real-world industrial settings poses challenges, as the system must balance two conflicting objectives: accurately recognizing commands while rejecting noise and irrelevant speech. To address this, we propose a modular framework designed to optimize recognition accuracy and rejection robustness while minimizing the need for extensive industrial dataset collection. The framework features an efficient Command Recognition module trained on laboratory-collected data augmented with synthetic samples. Advanced context-aware data augmentation techniques and dynamic noise injection further enhance the model’s robustness. To improve reliability in noisy environments, a Keyword Spotting module is introduced, activating the recognition system only when a predefined keyword is detected. The proposed system was evaluated using real-world samples collected in a noisy industrial setting. The results demonstrated a high recall rate for both command recognition and noise rejection, confirming the system’s effectiveness in meeting the demands of industrial applications.
Subject terms: Computer science, Information technology
Introduction
The Industry 4.01 revolution has transformed the interaction between robots and human operators in industrial environments. Previously, robots were employed in isolated workspaces, performing repetitive tasks without interacting with humans. Today, robots are asked to work with human operators, engaged in cooperation and collaboration. A crucial aspect of this collaboration is the communication between the entities involved. In fact, through speech-based communication, the human operator can ask the robot to bring a specific tool or perform a specific task in a natural way, improving the effectiveness of the task.
In this context, the robot has to understand the human intent (e.g., whether the operator is asking the robot to bring a tool or to take a tool) and identify the specific entities involved (e.g., which tool is being referred to), starting from a high-level input, namely the human voice, which is an audio signal. This challenge is known in the literature as Spoken Language Understanding (SLU)2. To solve this problem, it is possible to adopt either a conventional SLU system composed of Dual-Stage Architecture3 or an End-to-End Approach (E2E)4. The Dual-Stage Architecture is characterized by the concatenation of two well-known modules: (a) the Automatic Speech Recognition (ASR) module, which analyzes the audio sample to generate the corresponding textual transcript5,6; (b) the Natural Language Understanding (NLU) module7, which takes the textual transcript as input and identifies the intent of the user8. Although these systems find applications in well-known products such as Alexa, Google, and Siri, allowing them to handle scenarios where sentences are not known a priori, when moving to industrial (and thus noisy) environments, these systems exhibit an error rate ranging from 20% to 30%9. Furthermore, the aforementioned systems are cloud-based and require an internet connection to work, which can introduce delays in running time and is not always available in industrial settings10, also due to internal cybersecurity policies. In contrast, a speech command classifier in industrial environments needs to be capable of operating in real-time and on embedded devices to be mounted on board of the robot, with low computational resources and execution time (short response time) to enhance human-robot interaction. Furthermore, to achieve a good level of generalization, the Dual-Stage Architecture requires a considerable amount of data and time for training11. Specifically, the ASR module is trained to recognize most of the words in the human vocabulary. The effort to train these kind of systems may be wasteful in contexts where the set of commands is fixed and predefined. In general, command sets for these kind of problems are composed of approximately 30 commands, as seen in datasets such as the Google Speech Commands dataset12 and MIVIA-ISC dataset13, which consists of 31 commands. In these cases, it is preferable to adopt approaches that perform direct classification of the command set using the raw audio signal as input, without relying on intermediary data representations such as text transcription, as seen in Dual-Stage methods. Methods that operate in this manner are known as E2E approaches. E2E systems map high-level inputs, such as raw audio signals14 or time-frequency representations (e.g., Mel-Spectrograms15), directly to the corresponding intents. These systems offer several key advantages. First, they require few data for training and consequently fewer model parameters. Additionally, their single-module architecture simplifies the learning process and eliminates error propagation14, where errors in early stages adversely affect downstream components. Moreover, E2E systems require a low computational burden, making them suitable for real-time command recognition on edge devices used with robots. Given these benefits, E2E methods are particularly well-suited for industrial applications, where their efficiency enables a simple integration with robots equipped with resource-constrained embedded devices16.
In this paper we propose an E2E system able to work in real-time in the wild in a real industrial scenario, with the possibility to run directly on board of the robot. In the aforementioned works, tests were mainly conducted with the user close to and facing the robot, ensuring clear speech acquisition by the microphone. However, during work operations, the operator might be far from the robot or talking with other colleagues, not directly facing it. These factors can cause the system to either reject a valid command if the agent is too far from the robot or misclassify a command due to environmental noise. Consequently, it is likely to confuse noise or normal speech with commands, potentially causing the robot to perform unexpected actions. This can significantly impact the assembly line productivity and introduce safety problems for both human operators and robots. Therefore, we address these context-related problems by introducing methodological design choices specifically designed to face the following challenges that may impact system performance: (i) ambient noise; (ii) dynamic distance between the human operator and the robot; (iii) false positives due to the fact that the system is always active. To address (i) and (ii), our effort has focused on improving the accuracy with respect to the command recognition abilities; basically, our aim is to reduce the number of false positives generated due to miss-classification of a command as another command. To achieve this objective, we performed the following incremental steps:
We collected a command-speech dataset in a controlled laboratory environment, due to the possibility of easily collecting a huge amount of data; then, we augmented these noisy-free samples with samples containing industrial noise (such as screwdrivers) and generic noise (such as people talking in the background), leveraging Curriculum-Learning17 procedure to dynamically adapt the Signal-to-Noise Ratio (SNR) during the learning procedure. This protocol differs from traditional training methods by introducing a progressive learning approach. Initially, the model is trained on “easy” samples, such as noise-free audio signals. As training progresses, the model is gradually exposed to more challenging samples, with increasing levels of noise in the audio signals.
We explored the use of novel synthetic data generation methods to increase the diversity of training samples without requiring costly and time-consuming dataset collection campaigns. To generate synthetic samples, we used both well-defined state-of-the-art methods such as spectrogram generators18, which generate spectrograms from text, and vocoders19, which generate audio tracks from spectrograms. We also used a novel approach based on voice cloning20.
We added context-inspired data augmentation procedures in order to simulate phenomena observed during on-the-field tests, such as gain adjustment, in order to simulate different distances of the human operator in giving the command, and Time-Masking, in order to simulate commands given with pauses.
At the conclusion of this iteration, the model was evaluated on a dataset of commands collected in real-world conditions within a noisy industrial environment, specifically at the Centro Ricerche Fiat (CRF) in Melfi, Italy. The results demonstrate that the proposed system achieves a command recall exceeding 90%. However, it was observed that meanwhile the system ability to recognize speech commands improved, its effectiveness in rejecting non-command samples declined, resulting in instances where non-command utterances were mistakenly classified as valid commands. In order to solve this problem, the second design choice was made and represented by the addition of KeyWord-Spotting (KWS) which is added on the top of the E2E speech command classification module. KWS is a speech recognition technology that enables the identification of specific keywords within an audio stream. This technology is commonly used in applications such as virtual assistants, voice commands, or voice search systems, where the system is designed to activate or respond to a particular word (e.g., “Ok Robot”). KWS is characterized by its ability to analyze audio in real-time and recognize predefined phrases or words without the need for full language comprehension. In our application we use KWS as an initial filter, rejecting both noise and non of interest speech, allowing the command classifier to be activated only when explicitly required by users. The KWS satisfies real-time constraints and is used to detect a specific set of keywords. By combining the KWS classifier with the rejection capabilities of the speech command module, we achieve a substantial reduction in false alarms and an improvement in the overall performance of the system, obtaining a system that can be effectively used in a real-world industrial environment.
In conclusion, the contributions of this paper are as follows:
Speech-Command Recognition Architecture: We propose an E2E architecture designed for real-world dynamic industrial environments, satisfying real-time constraints and suitable for deployment on embedded systems integrated into robots.
Speech-Command Dataset: We introduce a two-part dataset. The first part comprises noise-free samples collected in a controlled environment for training, while the second part consists of an industrially collected dataset that reflects real-world usage scenarios, used as the test set.
Comprehensive Design Analysis: We perform an extensive analysis of the design choices proposed to improve the overall performance of our system. These include context-aware data augmentation techniques, synthetically generated audio samples, and the application of the curriculum learning technique at training time. The results showed that the proposed design choices led to a classifier capable of achieving an overall F1-score of 95.8% on command samples; however, the system classified a relatively high number of rejection samples as commands. To address this issue, we added a keyword spotting module, which reduced these false positives from 1,959 to just 17. The final system is now reliable enough for real-world applications, as it can effectively recognize relevant commands while filtering out noise and chit-chat.
Related works
In this section, we will review prior research in the domain of Speech Command Recognition. We will begin by outlining the most widely adopted architectures within the broader scope of Spoken Language Understanding (SLU). Subsequently, we will focus on the literature specifically addressing the problem tackled in this work. Finally, we will discuss the various synthetic data generation techniques that have been presented in the literature.
Spoken language understanding and recognition
When we solve the Spoken Language Understanding problem, we are addressing the challenge of developing a parameterized function , which takes as input a digital audio signal , where T is the length of the audio sample and C is the number of channels, and it provides as output a vector of binary labels , where indicates the probability that the intent is encoded in the input signal8:
1 |
An example of the input might represent the utterance “I like to watch action moves”, with the corresponding intent WhatchMovie. The most common way to get the function is through the usage of learning techniques, that start from a dataset composed of M samples, and solves a supervised learning problem. From the literature analysis, we identified two main high-level approaches to solve this problem: Dual-Stage and End-to-End. These approaches will be described in the next subsections.
In this context, the Speech Command Recognition is a specific instance of the Speech Language Understanding problem, where the intent expressed by a human corresponds to a command, i.e., a short phrase describing an action that an agent must perform (e.g., “Bring me the elbow screwdriver”). This problem is particularly relevant in industrial environments, where the goal is to enable robotic platforms to execute requested tasks based on spoken instructions. However, several challenges arise in such scenarios:
Rejection: The system must be capable of distinguishing between valid commands and non-command speech, such as casual conversations or chit-chat.
Environmental Noise: High levels of background noise in industrial settings can make commands difficult to understand.
Limited Data: The scarcity of labeled data makes it challenging to build representative datasets, complicating the learning process.
Dual-stage approach
The Dual-Stage Approach represents one of the first attempts to solve the SLU problem. The core of this approach relies in the combination of two well known and reliable solutions, the ASR and the NLU. The ASR is responsible for converting the audio signal into a textual representation, that is fed to the NLU module that performs intent classification. This kind of approach has been widely used in different prior works21,22. However, there are a bunch of drawbacks, that make them infeasible to use in our context of interest. The two modules of these architectures must be trained separately using data from different domains, as they address different problems. The ASR module requires audio and corresponding text transcriptions for training, while the NLU module requires text transcriptions and associated intents. Therefore, the overall system needs a substantial amount of data for training. Furthermore, since the two modules are used in sequence, errors generated by the ASR module can negatively impact the performance of the NLU module. To develop an effective system, significant effort is needed to train a high-performance ASR module that can operate in noisy environments, along with a robust NLU module capable of accurately analyzing transcriptions even in the presence of typos.
End-to-end approach
In End-to-End SLU, a single trainable model maps the speech audio directly to the speaker intent without explicitly producing a text transcript. In the literature, this architecture can be implemented in several ways, with the most popular approaches based on Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). In particular, Lugosch et al.14 use a SincNet module23 to process raw audio, followed by an RNN module to perform intent classification. Other common approaches in the literature employ CNNs15 to process speech inputs such as spectrograms or Mel-Frequency Cepstral Coefficients (MFCC). In this context, the End-to-End architecture has several benefits. First, the use of audio samples avoids downstream errors due to incorrect transcripts. Since this architecture consists of a single module that maps the audio input directly to intent, it does not suffer from the problem of error propagation. Moreover, the direct use of audio samples allows the model to extract additional information present in the speech signal but not in the transcript, such as prosody. However, a limitation of this model concerns its application domain. Since it cannot learn from text data, new audio data must be recorded to train the model for every new SLU domain or application. Additionally, the dataset must contain all the speech commands that the model needs to learn, and samples must be collected from different speakers, making it more challenging to create datasets for each new domain compared to previous solutions. Nevertheless, this configuration is able to achieve high performance when the command set is well-defined.
Large language models approach
Recently, Transformers24 have demonstrated their effectiveness across various domains, tackling tasks such as machine translation, document generation, and syntactic parsing. Among the most notable architectures based on this technology, we can mention the GPT model25, the first large-scale transformer network. GPT is pretrained in an unsupervised manner and fine-tuned for specific problems, achieving state-of-the-art results on NLP benchmarks. Building on similar principles, BERT26 was subsequently introduced, becoming a milestone in NLP research alongside GPT.
BERT success can be attributed to two key innovations. First, it leverages the idea of pretraining a transformer model on a massive corpus of text, followed by task-specific fine-tuning. Second, it employs a bidirectional transformer architecture by stacking encoders, enabling it to capture context from both directions. While BERT and GPT were designed for Natural Language Processing tasks, speech data contains additional layers of information, such as speaker identity, emotion, hesitation, and interruptions.
To address the challenge of modeling this rich lexical and non-lexical information, HuBERT27 was introduced. HuBERT builds on BERT pretraining methodology but is adapted for self-supervised speech representation learning. Like BERT, HuBERT can be pretrained on large quantities of unlabeled audio data and fine-tuned on domain-specific labeled datasets. According to the SUPERB28 benchmark, models like HuBERT excel in various spoken language understanding (SLU) tasks. Notably, HuBERT achieves an impressive 98% accuracy in intent classification tasks.
However, these models come with a significant challenge, they need the ASR module to get the textual representation of the audio signal, thus inheriting all the problems described before.
Synthetic data generation
A speech-command recognition model is an application-specific system that needs to be trained with a particular set of commands. To achieve this, training, validation, and test samples must be collected. However, gathering a representative dataset that encompasses various situations (e.g., different speaker tones, varying background noises, and diverse rejection samples) is both time-consuming and expensive.
In the early days of deep learning, researchers introduced the concept of fine-tuning to improve sample efficiency in application-specific training29. This approach involves starting with a model that has been trained on a large dataset representing a similar scenario. However, when the application domains differ significantly, and application-specific data are limited, the resulting model may face challenges with generalization and robustness, particularly in handling noise and rejection.
To add variability to the data and improve robustness, data augmentation techniques can be used, as explained in previous sections. However, even with these techniques, the added variability is limited and does not fully address generalization issues related to different vocal tones, such as the variations between male and female voices. Therefore, it is necessary to generate completely novel samples in the most efficient and effective way.
In recent years, advancements in deep generative models have enabled the use of these systems to generate novel samples for datasets, including images, text, and audio30. Leveraging these techniques to increase dataset size and add meaningful variability is a reasonable approach. Indeed, we can find in literature different works that use synthetic data to improve the performance of their methods. Antoniou et al.31 utilized a generative approach based on generative adversarial networks (GANs)32 to address the limitations of classic data augmentation techniques on images, generating synthetic data for training deep neural networks to solve various tasks. Lugosch et al.33 employed Text-to-Speech (TTS) engines to create synthetic samples of the command set from the Fluent Speech Command Dataset14. They used both real and synthetic data to train a deep learning model for intent recognition and slot filling tasks, achieving performance improvements compared to experiments without synthetic data. Considering these findings, TTS services can be an optimal solution for generating a large amount of data with minimal effort, as only a text transcription is required to generate samples. State-of-the-art approaches typically use two-stage architectures: a spectrogram generator and a vocoder. The spectrogram generator takes text input and produces a Mel-spectrogram representation, while the vocoder converts the spectrogram into a waveform. Specifically, the Tacotron 2 model34 employs convolutional and LSTM layers to generate Mel-spectrogram frames from the transcription of a sample. The waveform is then generated using Wavenet, a generative raw audio waveform network trained to capture the characteristics of various speakers, demonstrating impressive results. In recent years, with the advent of transformers, many TTS methods have been developed based on this approach. Wang et al.35 employed a zero-shot approach, training a transformer-based neural network to address various speech synthesis tasks, such as zero-shot TTS, speech editing, and content creation. The network takes as input the target text for generating the audio waveform and requires only 3 seconds of an audio sample of the target transcript. Notably, this approach can generate any audio starting from an audio sample of an unseen speaker and the corresponding text. Consequently, it is possible to generate a vast amount of high-quality synthetic data without the need to retrain any models.
Proposed method
In this section, we describe the methods proposed in this paper to address the problem of classifying commands belonging to one of the N classes and the approaches used to achieve robustness in real-world industrial scenarios, including rejection strategies and false positive reduction.
Overall architecture
An overview of the proposed architecture is depicted in Figure 1. As we can see from the figure, it includes a microphone for collecting signals from the environment (namely a Respeaker Mic Array V2) and a filtering component that uses the microphone’s built-in Voice Activity Detection (VAD) algorithm (Signal filtering) to discard audio frames without the user voice activity. Additionally, it also exploits a KeyWord Spotting (KWS) module that determines if the user utterance is a keyword, and a command recognition module that classifies speech commands. Basically, in the operating phase, the human operator has to first activate the command recognition system by saying the keyword. The speech signal is initially filtered by the VAD module, which discards non-speech chunks. Once the keyword is detected, the user can issue a command, which is again filtered by the VAD module before being processed by the command recognition module. The user has 15 seconds to provide the command and, if necessary, repeat the command in case of rejection. If the command is not recognized within this time frame, the command recognition module is set to idle again, and the user needs to pronounce the keyword once more. This countdown mechanism has been added to minimize delays and prevent the user from having to repeatedly say the keyword in case of rejection, thus ensuring a smoother interaction. If the command is correctly detected, the system sends it to the actuators, and the command detection module returns to idle.
Fig. 1.
Overview of the proposed system. The user says the keyword, the microphone’s VAD discards non-speech frames, and the filtered audio signal is given as input to the KWS. If a keyword is detected, the user can then issue a command. The signal is filtered once more by the VAD and then given as input to the command recognition module, which will either accept or reject the command.
The entire system was deployed on a low power and low energy embedded device, namely an NVIDIA Xavier NX; furthermore, the command recognition network was optimized for inference using the ONNX Runtime framework36. The software architecture was developed using the Robotic Operating System (ROS)37, which enables the creation of a flexible and easily maintainable system, fostering modularity.
Speech command recognition module
The speech command recognition has been formulated as a classification problem, being N the command classes to be recognized. As already discussed, the literature offers a variety of approaches categorized into Dual- Stage, Large Language Models, and End-to-End methods. Dual Stage approaches are prone to error propagation issues and are typically designed for more complex tasks, often incorporating an ASR module, which requires larger datasets for training and exhibits higher inference times. Large Language Models have shown impressive results in intent recognition tasks, but demanding extensive data and their computational complexity restricts their deployment on embedded systems. In light of these considerations, and recognizing that our command set remains fixed and limited in size, E2E approaches provide a favorable balance among performance, inference time, and the amount of data required for training.
Within this context, in this paper we propose to employ a Conformer architecture38, which integrates convolutional neural networks with transformers, harnessing the respective strengths of both architectures to improve speech processing tasks. The Conformer architecture was specifically chosen due to its demonstrated state-of-the-art performance in the Automatic Speech Recognition (ASR) task. Previous research13 has shown that the Conformer outperforms other approaches, such as ResNet and MobileNet, in command classification. The authors reported an average accuracy of 91.40% for the Conformer, compared to 90.32% for ResNet and 89.52% for MobileNet. Notably, the Conformer exhibited superior robustness in handling high-noise samples. For instance, at a SNR of 0 dB, the Conformer achieved an accuracy of 83%, while ResNet and MobileNet achieved only 67% and 72%, respectively.
Initially, the audio waveform is preprocessed and represented with a mel-spectrogram, which is then fed into the Conformer encoder. The encoder performs preliminary signal processing using convolutional and linear layers before the data is processed by stacked Conformer blocks. Each Conformer block contains two feed-forward networks (FFNs) positioned at the top and bottom of the block. Sandwiched between these FFNs, there is a module for multi-head self-attention and a convolutional block. The entire block is characterized by the presence of skip connections, which facilitate the flow of gradients during training, enhancing the model’s performance and convergence. In our implementation, the Conformer model processes mel-spectrograms extracted from audio signals. These spectrograms are generated by applying a Fast Fourier Transform (FFT) to the audio data using a sliding window with a duration of 32 ms and a hop size of 16 ms.
Dataset
The dataset used to train the speech command recognition system was built upon the MIVIA-ISC dataset16. In this dataset, audio samples were collected using a Telegram bot via the microphone of users’ devices. In our upgrade, data were collected using the Respeaker Mic Array V2, implementing a specific data acquisition protocol. Furthermore, synthetic audio samples were generated using state-of-the-art TTS services to increase the amount of data, and these were added exclusively to the training set. While the training set was acquired in real-world scenarios within the manufacturing environment, the test samples were specifically collected during working days to ensure realistic conditions. The data were acquired while users were actively engaged in operations at their workstations. Consequently, speech commands were recorded at varying distances from the microphone, often while the user was not oriented towards the microphone, thereby affecting the signal-to-noise ratio (SNR) levels. Moreover, since manufacturing machines were operational during working hours, the recorded samples naturally incorporated real industrial noise, including tool sounds (e.g., screwdrivers, conveyor belts) and background speech. This setup enabled a comprehensive evaluation of the system?s performance in a real-world industrial environment, assessing its robustness, generalization capabilities, and ability to work effectively in industrial conditions.
In more details, in our proposed dataset, the audio samples were collected in a noise-free environment using the aforementioned microphone, capturing speech at various distances and user head orientations. Initially, users were asked to say the commands from the command set at distances of 1 and 2 meters, with their heads oriented toward the microphone. Subsequently, they were asked to repeat the commands with their heads oriented in the opposite direction (with their backs to the microphone). Head orientation is crucial to avoid overfitting due to the Direction Of Arrival (DOA) of the voice.
In addition, to increase the amount of data, synthetic samples were generated using open-source TTS services as presented in the literature. We utilized the NVIDIA NeMo framework39, employing both dual-stage architectures composed of spectrogram generators and vocoders such as FastPitch40 and HiFi-GAN19, as well as E2E approaches like VITS41. These approaches are mostly multi-speaker, enabling the generation of various audio samples using the same text input and model architecture. However, the quality of samples generated is lower compared to premium TTS services, such as Amazon Polly42. It is crucial to consider that low-quality data can negatively impact the system’s performance.
Furthermore, we adopted an innovative Zero-Shot TTS approach based on the VoiceCraft framework20. This technique extracts the key voice characteristics from an input audio sample and its transcript, then generates the desired high quality audio using the target text. To achieve this, we used speakers and their associated speech transcriptions from the LibriSpeech dataset43. Thus, the model receives as input the target command to be generated as text, an audio sample from one of the speakers retrieved from the aforementioned dataset, and its corresponding transcript. Indeed, employing this technique, we generated a substantial amount of data from over 200 different speakers.
In light of these considerations, our system has been trained to recognize a subset of commands that our research group collected and available in the MIVIA-ISC dataset, specifically comprising 10 command classes plus a rejection class, totaling 11 classes for prediction. Note that the choice of this specific set of commands has been determined by the specific application scenario where we deployed the proposed system, namely Centro Ricerche Fiat (CRF) in Melfi, within the EU project FELICE. The distribution of commands in the dataset, including synthetics, is depicted in Figure 2, where commands are labeled from 0 to 10, and the rejection class is labeled as 11. The mapping between IDs and commands is presented in Table 1. As shown in the table, the list of commands poses significant challenges. It includes both concise commands such as “Back Home” or “Open/Close the Gripper”, and lengthy commands like “Bring me the gun-screwdriver”. Additionally, some commands differ by only a single word, such as those referring to the elbow-screwdriver.
Fig. 2.
Commands distribution in the dataset.
Table 1.
The 11 speech commands defined for the robot in both English and Italian languages. The 11th command is not strictly a speech command, but rather the reject class.
ID | English |
---|---|
0 | Bring me the gun screwdriver |
1 | Take the gun screwdriver |
2 | Bring me the window control panel |
3 | Open the gripper |
4 | Close the gripper |
5 | Go to the line side |
6 | Back home |
7 | Bring me the first elbow screwdriver |
8 | Take the first elbow screwdriver |
9 | Bring me the second elbow screwdriver |
10 | Take the second elbow screwdriver |
11 | Reject |
We also present the distribution of data points, providing both a global view that includes all commands and a pairwise comparison across individual commands (Figure 3). Notably, real samples (blue crosses) appear more dispersed than synthetic samples (red dots), which can be attributed to intrinsic noise and the variability of real-world recording conditions. More importantly, while real and synthetic samples do not completely overlap, they remain relatively close in the feature space. This indicates that the inclusion of synthetic data introduces meaningful variability into the dataset, which deep learning models can exploit to enhance generalization, as reflected in our reported results.
Fig. 3.
t-SNE visualizations for the global view and for the 11 commands using Euclidean distance. t-SNE is computed with perplexity of 50. Real samples (blue crosses) and synthetic samples (red dots) are shown for each case.
Training
During the training of our model, we have used different context-dependent data augmentation techniques, directly inspired by the problems that we observed in the real-world industrial scenario. Specifically, these techniques include gain adjustment, to simulate speech at different distances from the microphone, and time masking, to mitigate the impact of breaks in the utterance.
In more details, as for the gain adjustment, one key consideration is the dynamic relative distance between the robot and the human operator, which can vary over time. As a result, commands may be issued at different distances. To simulate this variability during training, other than acquiring data at different distances from the microphone, a random gain adjustment is applied to the audio signal. Specifically, at each training step, a gain value in decibels has been randomly sampled from a closed continuous interval, , resulting in either attenuation or amplification of the signal according to:
2 |
Another challenge arises when operators introduce pauses or breaks while delivering commands. To address this issue, a time masking technique is applied to the spectrogram (Figure 4). Specifically, during training, a portion of the spectrogram is set to a value of 0 across all frequencies. The length of the mask is randomly sampled from the interval [0, ], where defines the maximum window size for masking.
Fig. 4.
Audio signal before and after time masking.
Furthermore, to ensure proper functionality, the system must be robust to the noise present in the audio samples. Since the collected samples are typically recorded in non-noisy environments, a dedicated protocol is required to achieve this robustness. Specifically, we build a Noise-Dataset containing different kinds of noise sample, with a total of 2684 audio samples:
General-Purpose Noise, that includes background speech and mechanical tool sounds, sourced from the Mozilla Common Voice dataset44 and the FreeSound Database45.
Industrial Noise, that includes specific noise samples recorded in the wild by CRF in Melfi.
During training, noise samples are randomly selected from this database and applied to the speech waveform at various Signal-to-Noise Ratio (SNR). The selection of the SNR values is guided by the Curriculum Learning procedure. Initially, the network receives input samples with a high SNR (the impact of the nosie is reduced). As training progresses through epochs, the SNR is systematically reduced, thereby increasing the presence of noise in the training data. The specific SNR values used in Curriculum Learning strategies are chosen within predefined minimum and maximum ranges. Thus, effective SNR value applied to the audio signal depends on the specific Curriculum Learning strategy employed, such as Per-Epoch Noise Mixing (PEM) and Gaussian-PEM46.
Specifically, the variation of the Signal-to-Noise Ratio (SNR) following the Gaussian-PEM strategy during training is depicted in Figure 5a, while Figure 5b illustrates the trend of the training and validation losses during the training process of the best-performing Conformer model (see Section “Results and discussion”). The observed peaks in validation loss, particularly in the initial training steps, can be attributed to the progressive variation in SNR values. The training process passes through simpler high-SNR samples to more challenging low-SNR samples, introducing fluctuations in model performance. Nevertheless, the overall downward trend of the loss curves indicates that the model successfully converged. The final low values of both training and validation loss confirm the model’s ability to generalize effectively.
Fig. 5.
Plots illustrating the variation of the Signal-to-Noise Ratio (SNR) across training epochs and its impact on both training and validation losses. These plots provide insights about the influence of noise levels on model convergence and generalization capability. Specifically, they highlight how dynamic SNR adjustments affect the learning process and contribute to optimizing model robustness in noisy environments.
System calibration
After training, a calibration process was conducted where a threshold th was set at the confidence levels of the classes. An audio sample is classified as belonging to class C if the maximum probability in the class probability distribution exceeds the threshold th. Otherwise, the sample is rejected. This thresholding mechanism ensures that only samples with sufficiently high confidence in their classification are accepted as commands, while others are disregarded:
3 |
is the output of Conformer model. The threshold value was determined by computing the precision-recall curve on the validation set and selecting the threshold that maximizes the F1-score. During testing, although the model correctly classified the true command classes, it occasionally failed to meet the threshold due to noise in the samples. To address this issue, a temperature47 parameter T was introduced into the softmax function (Formula 4). This temperature parameter allows for adjusting the confidence distribution of classes without altering their ranking.
4 |
Indeed, by introducing a temperature parameter into the softmax function, we can effectively boost the confidence of predicted classes in highly noisy conditions without fundamentally altering the predictions.
Keyword spotting detection
All the design choices presented so far aim to ensure that the system can accurately classify commands even in noisy industrial environments while effectively rejecting non-command inputs. However, this is not sufficient for deployment in real industrial scenarios. Indeed, since the system is always active, the rate of false positives must be carefully controlled. In an industrial environment, a false positive could result in the robot unintentionally starting to move, leading to safety issues and production inefficiencies.
To address this issue, two solutions have been proposed. The first solution involves adding a filter to discard samples that do not contain speech activity, using a Voice Activity Detection (VAD) module48. This has been implemented with the built-in algorithm of the Respeaker Mic Array V2, with an appropriate VAD threshold set based on field tests.
Additionally, to further reduce false positives, a KWS system has been integrated. Specifically, we use the EfficientWord-Net architecture49, fine-tuned with a few-shot learning approach based on Siamese Networks50. This architecture consists of two identical sub-networks that share the same architecture and weights. These networks process two different inputs to compute their embeddings, and a similarity metric (such as Euclidean distance or cosine similarity) is used to determine whether the inputs are similar. This approach has demonstrated promising results in learning effective keyword representations with limited data. In our work, we used cosine similarity as metric. It is a distance measure that remains robust against variations in modules that can be caused by noise.
Dataset
As mentioned earlier, KWS module was trained using a few-shot approach, requiring only a few samples. To accomplish this, we collected samples for the following keywords: “Ok Robot”, “Hey Robot”, “Ok Felice”, and “Hey Felice”, and trained the system using only 10 samples per keyword. Note that the name Felice has been used, since it is the name of the project partially financing the activities behind this paper. The test set consists of 1689 samples, including 439 positive samples containing one of the aforementioned keywords and 1250 negative samples that do not contain any keyword.
Training
The system was pre-trained on a dataset of synthetic samples generated using premium services such as Amazon Polly and Siri, and employing the Triplet Loss function51. Subsequently, subsets of 10 commands per keyword were collected from our dataset. For each subset, embeddings were computed to serve as prototypes. When an audio sample is acquired, the same network is used to compute the embedding of the target sample. The cosine similarity between the target embedding and the 10 prototypes is then calculated. If the maximum similarity value exceeds a predefined threshold, a keyword is detected; otherwise, it is discarded. This approach allows for easy training of the module without requiring a large dataset; furthermore, it simplifies the process of adding new keywords to the system as needed, thereby achieving high flexibility.
Experimental setting
In this section, we describe the experimental procedure and the results obtained. We trained the Conformer on the extended dataset described in the previous section, conducting a series of experiments aimed at addressing the challenges discussed in the “Spoken language understanding and recognition” section. In particular, the model was trained for a maximum of 200 epochs, with an initial learning rate of 0.001. The learning rate was adjusted during training using the Reduce On Plateau method, which had a patience of 5 epochs and a decrease ratio of 1e-2. Early stopping criteria were set with a patience of 12 epochs and a delta of 0.005, which represents the minimum increase in the score required to qualify as an improvement. The Adam algorithm was adopted as the optimizer. The model was trained on an A100 GPU using a total of 128 workers, enabling high-throughput data processing and minimizing delays associated with the aforementioned data augmentation techniques. This efficient parallelization resulted in a remarkably low training time of only 4 hours.
Evaluation Metrics Given the defined dataset and experimental setup, we need to establish the metrics used to evaluate the different methods. Our primary objective is to develop a method that optimizes both the correct classification of commands and the accurate identification of rejections. Furthermore, regarding command classification, we aim to separately evaluate cases where a command is misclassified as another command and cases where it is incorrectly rejected. For these reasons, we define the following metrics:
Command Recall (): Measures the system’s ability to correctly recognize commands without misclassifying them as rejections.
Command Precision (): Evaluates how accurately the system classifies commands, minimizing confusion between different (especially similar) commands.
Rejection Recall (): Assesses the system’s ability to correctly classify rejection samples, ensuring they are not misclassified as valid commands.
Command F1-score (): Overall metric that considers both the recall and precision of the command class.
These metrics are fully defined in formula 5. Specifically, in this context: represents a command sample correctly classified with the correct command. denotes a rejection sample correctly classified as a rejection. refers to a false positive for the command class, meaning a command sample is misclassified as another command. is a false negative for the command class, where a command sample is incorrectly classified as a rejection. denotes a false negative for the rejection class, where a rejection sample is misclassified as any of the commands.
With these metrics, we can effectively measure the system’s performance in correctly classifying commands while accurately identifying rejections.
5 |
Test Set As stated in the paper, although the model is trained on a large dataset collected in a controlled, noise-free environment, our goal is to test it in a setting as similar as possible to the real world. Therefore, we gathered a completely independent test set consisting of audio samples collected during real-use-case scenarios in an actual industrial environment.
These samples are collected from users outside the training distribution, who give commands naturally while working at their workstations. Generally, these users are far from the microphone mounted on the robot and not aligned with it. This approach allows us to effectively evaluate the results of applying context-inspired data augmentation. The dataset comprises a total of 195 command samples distributed as shown in Figure 6.
Fig. 6.
Distribution of commands in the proposed test set.
Inference time Given that the proposed system is intended for deployment in real-world environments, adherence to real-time constraints is imperative. Accordingly, we analyzed the processing time required by the system’s core components. All timing measurements were obtained on the target deployment platform, namely the NVIDIA Jetson Xavier NX. As previously outlined (see Figure 1), the system includes three primary modules: Keyword Spotting (KWS), Voice Activity Detection (VAD), and Speech Command Recognition. The KWS module, which leverages a ResNet50-based architecture in conjunction with a similarity-based matching strategy between prototypes and candidate features, requires approximately 20 milliseconds for feature extraction and keyword identification. The VAD module is triggered upon detecting speech activity and continues collecting audio data for 1.4 seconds before terminating the stream. This duration was empirically determined as a compromise: shorter time windows risk prematurely truncating the speech signal, thereby compromising the accuracy of subsequent processing stages, while longer durations unnecessarily increase overall system latency. Finally, the Speech Command Recognition module, which adopts a Conformer model, completes classification in 33 milliseconds. We can conclude that the method gives an answer with a maximum dealy of 1.5 seconds, that is acceptable for the problem at hand.
Results and discussion
Do synthetic samples add useful variability, improving generalization performance on the test set?
We first address the question of the effectiveness of synthetic data in the most challenging setting, which excludes any kind of data augmentation. With this setup, we can effectively evaluate whether the synthetic samples add informative variability or just noise.
Table 2 summarizes the obtained results, highlighting key trends. First, the use of the Curriculum Learning procedure generally enhances the system’s ability to correctly classify commands, as indicated by the increase in for both setups (with and without synthetic samples). This effect is particularly evident when using synthetic samples, where improves from 0.904 to 0.943.
Table 2.
The performance of the Conformer model is evaluated by comparing the results with and without synthetic samples, as well as conducting a preliminary evaluation of the Curriculum Learning procedure (CL). In this comparison, the recall, precision, and F1-score metrics are reported for command samples (, , ) and recall for rejection samples (.
without synth | 0.922 | 0.823 | 0.869 | 0.818 |
without synth+CL | 0.955 | 0.939 | 0.947 | 0.801 |
Avg (without synth) | 0.939 | 0.881 | 0.908 | 0.810 |
with synth | 0.904 | 0.936 | 0.920 | 0.841 |
with synth+CL | 0.943 | 0.927 | 0.935 | 0.757 |
Avg (with synth) | 0.924 | 0.932 | 0.928 | 0.799 |
However, this improvement comes at the cost of a reduced ability to recognize rejection samples. The most notable decline occurs in the synthetic sample experiment, where drops from 0.841 to 0.757. A possible explanation for this behavior is that the progressive introduction of noise through CL lowers the Signal-to-Noise Ratio (SNR), introducing a bias that leads the model to misclassify some rejection samples as commands.
Regarding the impact of synthetic samples, models trained with them generally perform better at command classification, as indicated by a higher average (0.932 vs. 0.881). However, models trained without synthetic samples exhibit superior rejection classification performance ( vs 0.799) and a lower tendency to misclassify commands as rejections ( vs 0.924).
Overall, while it is difficult to draw definitive conclusions about the impact of synthetic samples, the results suggest that they introduce beneficial variability in command signals rather than simply adding noise, thus enhancing command classification performance.
Given the differences between the training set (controlled, noise-free environment) and the test set (real-world, noisy environment), are domain-inspired data augmentation techniques effective in improving robustness on the test set?
As the second step of our experiments, we aim to verify whether the domain-inspired data augmentation techniques contribute to improvements over the baseline reported in Table 2.
Table 3 summarizes the results highlighting key trends.
Table 3.
The performance of the ResNet and Conformer models is evaluated by analyzing the impact of incrementally incorporating domain-specific data augmentation techniques. Both models are trained under two conditions: with synthetic samples (with synth) and without synthetic samples (without synth). The evaluation considers recall, precision, and F1-score metrics for both command samples (, , ) and recall for rejection samples ().
ResNet | Gain (with synth) | 0.805 | 0.877 | 0.840 | 0.908 |
Gain + Masking (with synth) | 0.775 | 0.705 | 0.738 | 0.874 | |
Gain + Masking + CL (with synth) | 0.806 | 0.725 | 0.763 | 0.855 | |
Avg (with synth) | 0.795 | 0.769 | 0.780 | 0.879 | |
Conformer | Gain (with synth) | 0.948 | 0.911 | 0.929 | 0.792 |
Gain + Masking (with synth) | 0.966 | 0.945 | 0.956 | 0.828 | |
Gain + Masking + CL (with synth) | 0.972 | 0.945 | 0.958 | 0.728 | |
Avg (with synth) | 0.962 | 0.934 | 0.947 | 0.783 | |
Gain (without synth) | 0.954 | 0.928 | 0.941 | 0.749 | |
Gain + Masking (without synth) | 0.878 | 0.857 | 0.867 | 0.835 | |
Gain + Masking + CL (without synth) | 0.936 | 0.904 | 0.920 | 0.663 | |
Avg (without synth) | 0.923 | 0.896 | 0.909 | 0.749 |
First, we observe that the Conformer architecture consistently outperforms the ResNet architecture, aligning with the findings in13. Specifically, the Conformer model achieves an average improvement of in the metric. This highlights the superior architectural capabilities of Conformer, which are particularly crucial in the highly challenging classification scenario examined in this paper. Notably, the test samples, real-world noisy recordings from an industrial production line, differ significantly from the training samples, which are noise-free and collected in a controlled laboratory environment.
Second, this series of experiments further supports the idea that high-quality synthetic data enhances the model’s generalization capabilities. On average, models trained with synthetic samples outperform those trained without them across all evaluated metrics. Specifically, models trained with synthetic data are better at avoiding misclassification of commands as rejections (: 0.962 vs. 0.923), correctly classifying commands (: 0.934 vs. 0.896), and accurately identifying rejection samples (: 0.783 vs. 0.749). These results highlight the effectiveness of synthetic data in improving classification performance across diverse scenarios.
In conclusion, analyzing the performance of models with domain-inspired data augmentation reveals that the two best-performing methods in terms of incorporate both Gain Modulation and Masking augmentations. This suggests that these techniques effectively address real-world challenges, such as variations in the distance between the human operator and the robot’s microphone.
Overall, the best-performing model achieves an of 0.958, marking a significant improvement of +0.118 over the baseline, Conformer trained without synthetic data and without data augmentation (first row of Table 2). However, this improvement comes at the cost of a reduction in , which drops to 0.728, representing a decline of −0.11. This decrease is noteworthy, as it indicates a higher likelihood of misclassifying rejection samples (e.g., noise, tools, and background conversations) as valid commands, which could have practical implications in real-world deployments. To address this issue, we propose an additional module based on Keyword Spotting, which enables command recognition only when necessary. The results of this component are discussed in the following section.
Does the use of a KWS detection system help reduce the number of rejection samples misclassified?
With the previous steps, we developed a Command Speech Recognition System that can recognize commands in a robust and consistent manner, achieving a F1-score of 95.8% on a real-world test set, despite being trained on a dataset collected in a controlled laboratory environment. However, as previously mentioned, optimizing the system for command recognition led to a decrease in performance for rejection recognition. To address this issue while maintaining command recognition performance, we proposed a Keyword Spotting system. This system is designed to activate the recognition module only when explicitly prompted by the human operator.
To validate this approach, we tested the KWS system on the 1959 rejection samples that were mistakenly classified as commands, evaluating how often the system correctly detects the keyword “Hey robot” or “Ok robot” compared to instances where it does not. This simulates real-world scenarios in which background noise might cause the Command-Speech Recognition module to erroneously send a command to the underlying control systems.
Our evaluation shows an accuracy of 0.910, indicating that the system significantly reduces the number of false positives on rejection samples, lowering it from 1959 to 176. This represents a substantial improvement over the baseline. Combined with the enhanced command recognition accuracy, this solution results in a robust system that is not only easier to train but also highly effective and reliable in real-world applications. The system maintains its performance in command recognition, as the KWS accurately identifies keywords within normal speech flow. Furthermore, it is more robust to chit chat and general noise, as these samples are filtered out by the KWS, which does not detect any keywords and, therefore, does not activate the underlying classifier.
Conclusions
In this paper we propose a Speech Command Recognition Framework designed for deployment in a real-world industrial environment. This scenario poses significant challenges due to two conflicting requirements: the system must accurately recognize intended commands while rejecting noise and irrelevant speech. These challenges are further emphasized by the difficulty and the cost of collecting a large, representative dataset needed for training such systems. To address these issues, we adopted a set of design choices tailored to the application context. Specifically, we simplified the dataset creation process by recording speech command samples in a controlled, noise-free environment. To enhance the dataset’s size and diversity, we added synthetic samples generated through novel generative models. Additionally, we collected a small test set of real-world samples recorded directly on the production line at the CRF facility in Melfi.
We propose to use a Conformer architecture, trained on the collected training set, incorporating context-inspired data augmentation techniques such as gain adjustment, time masking, and dynamic noise addition, following a Curriculum Learning procedure.
During testing, we observed that the inclusion of synthetic samples and context-inspired data augmentation consistently improved command recognition accuracy, achieving a F1-score of 95.8% on command samples, with an improvement of +11.8% over the baseline. Despite these significant improvements in command recognition, we found that the system produced an unacceptable number of false positives for the specific application scenario. To address this, we integrated a KeyWord Spotting module, which activates the command recognition system only when explicitly triggered by the human operator. This approach reduced the number of false positives from 1,959 to 176, resulting in an overall accuracy of 91%.
With these design choices, we achieved robust performance in both command recognition in noisy scenarios and the rejection of irrelevant speech. The proposed Speech Command Recognition module is thus suitable for deployment in a real-world industrial environment, while minimizing the cost of collecting representative and large dataset.
Acknowledgements
This work has received funding from the EU Horizon 2020 program under GA No. 101017151 FELICE52.
Author contributions
The authors contributed equally to this work.
Data availability
The datasets generated during and/or analyzed during the current study are available in the Zenodo repository, https://zenodo.org/records/14771083.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Giuseppe De Simone, Antonio Greco, Francesco Rosa, Alessia Saggese and Mario Vento contributed equally.
References
- 1.Lasi, H., Fettke, P., Kemper, H.-G., Feld, T. & Hoffmann, M. Industry 4.0. Business & information systems engineering6, 239–242 (2014).
- 2.Qin, L., Xie, T., Che, W. & Liu, T. A survey on spoken language understanding: Recent advances and new frontiers. In Zhou, Z.-H. (ed.) Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, 4577–4584, 10.24963/ijcai.2021/622 (International Joint Conferences on Artificial Intelligence Organization, 2021). Survey Track.
- 3.Coucke, A. et al. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprintarXiv:1805.10190 (2018).
- 4.Wang, D., Wang, X. & Lv, S. An overview of end-to-end automatic speech recognition. Symmetry11, 1018 (2019). [Google Scholar]
- 5.Fendji, J. L. K. E., Tala, D. C. M., Yenke, B. O. & Atemkeng, M. Automatic Speech Recognition Using Limited Vocabulary: A Survey. Applied Artificial Intelligence36, 2095039. 10.1080/08839514.2022.2095039 (2022). [Google Scholar]
- 6.Radford, A. et al. Robust speech recognition via large-scale weak supervision. arxiv 2022. arXiv preprintarXiv:2212.0435610 (2022).
- 7.Wu, T., Wang, M., Xi, Y. & Zhao, Z. Intent recognition model based on sequential information and sentence features. Neurocomputing566, 127054. 10.1016/j.neucom.2023.127054 (2024). [Google Scholar]
- 8.Chen, Q., Zhuo, Z. & Wang, W. Bert for joint intent classification and slot filling. arXiv preprintarXiv:1902.10909 (2019).
- 9.Li, C., Park, J., Kim, H. & Chrysostomou, D. How can i help you? an intelligent virtual assistant for industrial robots. In Companion of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, 220–224 (2021).
- 10.de Andrade, D. C., Leo, S., Viana, M. & Bernkopf, C. A neural attention model for speech command recognition. ArXivabs/1808.08929 (2018).
- 11.Qian, Y. et al. Speech-language pre-training for end-to-end spoken language understanding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7458–7462 (IEEE, 2021).
- 12.Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. ArXivabs/1804.03209 (2018).
- 13.Bini, S., Carletti, V., Saggese, A. & Vento, M. Robust speech command recognition in challenging industrial environments. Computer Communications228, 107938, 10.1016/j.comcom.2024.107938 (2024).
- 14.Lugosch, L., Ravanelli, M., Ignoto, P., Tomar, V. S. & Bengio, Y. Speech model pre-training for end-to-end spoken language understanding. Proceedings of Interspeech 2019 (2019).
- 15.Majumdar, S. & Ginsburg, B. MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. In Proc. Interspeech 2020, 3356–3360, 10.21437/Interspeech.2020-1058 (2020).
- 16.Bini, S., Percannella, G., Saggese, A. & Vento, M. A multi-task network for speaker and command recognition in industrial environments. Pattern Recognition Letters176, 62–68 (2023). [Google Scholar]
- 17.Bengio, Y., Louradour, J., Collobert, R. & Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, 41?48, 10.1145/1553374.1553380 (Association for Computing Machinery, New York, NY, USA, 2009).
- 18.Ren, Y. et al. FastSpeech: Fast, Robust and Controllable Text to Speech. In Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., 2019).
- 19.Kong, J., Kim, J. & Bae, J. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems33, 17022–17033 (2020). [Google Scholar]
- 20.Peng, P., Huang, P.-Y., Li, S.-W., Mohamed, A. & Harwath, D. Voicecraft: Zero-shot speech editing and text-to-speech in the wild. arXiv preprintarXiv:2403.16973 (2024).
- 21.Mesnil, G. et al. Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing23, 530–539. 10.1109/TASLP.2014.2383614 (2015). [Google Scholar]
- 22.Gorin, A. L., Riccardi, G. & Wright, J. H. How may I help you?. Speech Communication23, 113–127. 10.1016/S0167-6393(97)00040-X (1997). [Google Scholar]
- 23.Ravanelli, M. & Bengio, Y. Speaker Recognition from Raw Waveform with SincNet. In 2018 IEEE Spoken Language Technology Workshop (SLT), 1021–1028, 10.1109/SLT.2018.8639585 (2018).
- 24.Vaswani, A. et al. Attention is all you need. In Guyon, I. et al. (eds.) Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017).
- 25.Brown, T. et al. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, vol. 33, 1877–1901 (Curran Associates, Inc., 2020).
- 26.Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C. & Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186, 10.18653/v1/N19-1423 (Association for Computational Linguistics, Minneapolis, Minnesota, 2019).
- 27.Hsu, W.-N. et al. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing29, 3451–3460 (2021). [Google Scholar]
- 28.wen Yang, S. et al. SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, 1194–1198, 10.21437/Interspeech.2021-1775 (2021).
- 29.Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems25 (2012).
- 30.Eigenschink, P. et al. Deep generative models for synthetic data: A survey. IEEE Access11, 47304–47320 (2023). [Google Scholar]
- 31.Antoniou, A., Storkey, A. spsampsps Edwards, H. Data augmentation generative adversarial networks. arXiv preprintarXiv:1711.04340 (2017).
- 32.Goodfellow, I. J. et al. Generative adversarial nets. In Neural Information Processing Systems (2014).
- 33.Lugosch, L., Meyer, B. H., Nowrouzezahrai, D. & Ravanelli, M. Using speech synthesis to train end-to-end spoken language understanding models. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 8499–8503, 10.1109/ICASSP40776.2020.9053063 (2020).
- 34.Shen, J. et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4779–4783 (IEEE, 2018).
- 35.Wang, C. et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprintarXiv:2301.02111 (2023).
- 36.Onnx Runtime. https://onnxruntime.ai.
- 37.Robotic Operating System. https://www.ros.org.
- 38.Gulati, A. et al. Conformer: Convolution-augmented transformer for speech recognition. Proc. Interspeech 202010.21437/Interspeech.2020-3015 (2020).
- 39.Kuchaiev, O. et al. Nemo: a toolkit for building ai applications using neural modules. ArXivarXiv:abs/1909.09577 (2019).
- 40.La’ncucki, A. Fastpitch: Parallel text-to-speech with pitch prediction. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 6588–6592 (2020).
- 41.Kim, J., Kong, J. & Son, J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, 5530–5540 (PMLR, 2021).
- 42.Amazon polly. https://aws.amazon.com/polly.
- 43.Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206–5210, 10.1109/ICASSP.2015.7178964 (2015).
- 44.Ardila, R. et al. Common Voice: A Massively-Multilingual Speech Corpus. In Calzolari, N. et al. (eds.) Proceedings of the Twelfth Language Resources and Evaluation Conference, 4218–4222 (European Language Resources Association, Marseille, France, 2020).
- 45.Fonseca, E., Favory, X., Pons, J., Font, F. & Serra, X. FSD50K: An Open Dataset of Human-Labeled Sound Events. IEEE/ACM Transactions on Audio, Speech, and Language Processing30, 829–852. 10.1109/TASLP.2021.3133208 (2022). [Google Scholar]
- 46.Braun, S., Neil, D. & Liu, S.-C. A curriculum learning method for improved noise robustness in automatic speech recognition. In 2017 25th European Signal Processing Conference (EUSIPCO), 548–552, 10.23919/EUSIPCO.2017.8081267 (2017). ISSN: 2076-1465.
- 47.Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In International conference on machine learning, 1321–1330 (PMLR, 2017).
- 48.Yang, Q. et al. Svad: A robust, low-power, and light-weight voice activity detection with spiking neural networks. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 221–225, 10.1109/ICASSP48485.2024.10446945 (2024).
- 49.Chidhambararajan, R. et al. EfficientWord-Net: An Open Source Hotword Detection Engine Based on Few-Shot Learning. Journal of Information & Knowledge Management21, 2250059. 10.1142/S0219649222500599 (2022). [Google Scholar]
- 50.Koch, G., Zemel, R., Salakhutdinov, R. et al. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, vol. 2, 1–30 (Lille, 2015).
- 51.Schroff, F., Kalenichenko, D. & Philbin, J. Facenet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 815–823 (2015).
- 52.Felice: Flexible assembly manufacturing with human-robot collaboration and digital twin models. https://www.felice-project.eu/ (2021). Horizon 2020 project.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated during and/or analyzed during the current study are available in the Zenodo repository, https://zenodo.org/records/14771083.