Improving Robot Voice Recognition in Noisy Settings
Research focuses on helping robots better understand speech amidst background noise.
― 5 min read
Table of Contents
In recent years, the interaction between humans and robots has become an important area of research. As robots are designed to communicate more naturally with people, it is essential to improve how they understand human speech when there is noise, like their own voice or fan noise. This paper discusses methods to help robots better recognize and separate human voices during conversations.
The Problem of Overlapping Voices
When a robot speaks to a person, its own voice can interfere with its ability to understand what the person is saying. This is similar to being in a noisy room where you have trouble hearing someone talking to you. Current systems found in robots often require them to stop speaking to listen better, which creates an unnatural flow to the conversation. People cannot give feedback or respond naturally while the robot is talking.
To make interactions feel more natural, we need robots that can listen to humans while they speak at the same time. However, current automatic speech recognition systems struggle to separate overlapping voices. This paper explores how to improve this situation through specific techniques.
Target Speech Extraction
One of the main goals of this research is to develop a system that allows a robot to filter out its own speech and listen to the human voice more effectively. This involves using a method called target speech extraction (TSE). TSE aims to isolate the human voice from the overlapping sounds, such as the robot's noise.
To achieve this, we created a dataset of recordings that include both the robot's voice and human speech. The recordings took place in environments with different levels of background noise, allowing us to test how well the robot could understand human speech under various conditions.
Methodology
Data Collection
To gather the necessary data for testing our methods, we recorded three types of audio:
- Robot Speech: We recorded the robot speaking using different voices at various volumes.
- Human Speech: We recorded clean speech from a loudspeaker at different volumes to match the types of interactions a human might have with the robot.
- Combined Recordings: Using a software tool, we mixed recordings of the robot's voice with the human voice to create overlapping audio for analysis.
These recordings were done in quiet rooms to minimize additional noise, making it easier to study how well the robot could separate voices.
Signal Processing Techniques
We used two main approaches to improve how robots handle overlapping speech:
Signal Processing-Based Method: In this method, we used mathematical techniques to analyze the audio signals. The goal was to create a mask that helps isolate the human speech while reducing the robot's voice and background noise.
Neural Network-Based Method: We also tested a deep learning model that uses layers of algorithms to learn how to identify and separate different sounds. This approach involves training the model on the data we collected so it can learn to recognize human speech even when the robot is speaking at the same time.
Results
Voice Recognition Performance
The main measure of success for our methods is the accuracy of speech recognition, evaluated through tests on the recordings. We looked at two specific metrics:
- Word Error Rate (WER): This measure shows how many mistakes the system made in recognizing words from human speech.
- Scale-Invariant Signal-to-Distortion Ratio (SI-SDR): This metric evaluates the quality of the separated speech compared to the original target speech.
Through our testing in rooms with low and high reverberation, we observed important findings. When there was less echo, our processing methods showed significant improvement in recognizing human voices, but the performance dropped noticeably in rooms with greater echo.
Comparison of Methods
We found that the signal processing approach without additional filtering performed best in low echo conditions. In contrast, the neural network method demonstrated robustness in fluctuating environments, but it did not excel as much as we hoped in certain noisy situations.
Overall, while the signal processing method showed promise under specific conditions, the neural network method proved to handle variations in the environment better.
Challenges and Limitations
Despite the promising performance of our methods, we encountered several challenges:
Background Noise: The presence of echo from the robot's voice negatively impacted performance. The robot's speech often had more power than the human voice, complicating the recognition process.
Distortion Issues: Our signal processing approach sometimes resulted in distortion, making the output sound unnatural. This distortion occurs when speech signals are over-filtered, leading to missing or garbled sound segments.
Training Data Size: Although we used a specific dataset for training, it was significantly smaller than what other advanced methods utilize. A larger dataset may improve the learning process and overall system performance.
Future Directions
To improve the performance of robot speech recognition, we aim to take several steps:
Enhanced Noise Reduction: Developing improved methods to filter out echoes and background noise could help the system better isolate human speech.
Larger Training Datasets: By collecting more varied and extensive training data, we can enhance the machine learning model's understanding and accuracy.
Real-World Testing: Implementing our system in actual robot interactions will allow for more practical evaluations of its effectiveness and areas for improvement.
Conclusion
This study highlights the importance of improving communication between humans and robots. By working on methods to filter out a robot's own voice during conversation, we can create more natural and effective interactions. The results indicate that while we have made progress, more research and development are necessary to fully address the challenges faced in real-world scenarios. Through dedicated efforts, we hope to enhance the ability of robots to understand and respond to human speech effectively, even in noisy environments.
Title: Single-Channel Robot Ego-Speech Filtering during Human-Robot Interaction
Abstract: In this paper, we study how well human speech can automatically be filtered when this overlaps with the voice and fan noise of a social robot, Pepper. We ultimately aim for an HRI scenario where the microphone can remain open when the robot is speaking, enabling a more natural turn-taking scheme where the human can interrupt the robot. To respond appropriately, the robot would need to understand what the interlocutor said in the overlapping part of the speech, which can be accomplished by target speech extraction (TSE). To investigate how well TSE can be accomplished in the context of the popular social robot Pepper, we set out to manufacture a datase composed of a mixture of recorded speech of Pepper itself, its fan noise (which is close to the microphones), and human speech as recorded by the Pepper microphone, in a room with low reverberation and high reverberation. Comparing a signal processing approach, with and without post-filtering, and a convolutional recurrent neural network (CRNN) approach to a state-of-the-art speaker identification-based TSE model, we found that the signal processing approach without post-filtering yielded the best performance in terms of Word Error Rate on the overlapping speech signals with low reverberation, while the CRNN approach is more robust for reverberation. These results show that estimating the human voice in overlapping speech with a robot is possible in real-life application, provided that the room reverberation is low and the human speech has a high volume or high pitch.
Authors: Yue Li, Koen V Hindriks, Florian Kunneman
Last Update: 2024-03-05 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2403.02918
Source PDF: https://arxiv.org/pdf/2403.02918
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.