Simple Science

Cutting edge science explained simply

# Electrical Engineering and Systems Science# Sound# Machine Learning# Audio and Speech Processing

Advancements in Target-Speaker Speech Recognition

New model improves speech recognition in noisy environments by focusing on a single speaker.

― 4 min read


Target-Speaker ASRTarget-Speaker ASRBreakthroughamid background noise.New model transforms speech recognition
Table of Contents

Target-speaker automatic speech recognition (TS-ASR) is a technology that can listen to a group of people talking and pick out just one specific person's voice. This is useful in situations where many people are speaking at once, like in meetings or crowded places. The idea is to focus on the desired speaker's words while ignoring others around them.

Methods of Speech Recognition

There are different ways to do speech recognition in crowded environments. One approach is called blind source separation (BSS), which tries to separate different voices from a mixed audio. After separation, a standard speech recognition system is used to understand each speaker's words. However, BSS might not work perfectly because it doesn’t prepare the separated audio for recognizing speech directly.

Another approach is multi-speaker ASR, which creates transcripts for all speakers at once. This method needs to know about all the speakers involved in advance, which can be a downside. It can handle complex speech situations, but it may not always perform well if the number of speakers changes.

TS-ASR stands out because it only requires information about one target speaker. It aims to transcribe that person's words without getting confused by other voices. This design makes it easier to handle overlapping speech, as it does not face issues with tracking different speakers.

The Proposed Model

This article introduces a new model called CONF-TSASR designed for single-channel target-speaker ASR. It has three main parts:

  1. Titanet: This part takes a sample of the target speaker's voice to create a unique profile or embedding for that speaker.
  2. MaskNet: This component generates a mask to filter out the target speaker's voice from a mix of other sounds.
  3. ASR Module: This last part reads the filtered audio and transcribes just the words spoken by the target speaker.

The model is trained using two methods of loss, CTC loss and a new spectrogram reconstruction loss. These methods help the model learn to perform better.

Training and Performance

In tests involving two and three speakers, the new model showed impressive results. It achieved low error rates compared to other existing models. The performance was evaluated on various datasets, and the model set new records in some cases, meaning it was currently one of the best available for this task.

To prepare for training, the mixed audio was created by combining different speakers’ voices while maintaining a good quality of sound. Additionally, training data was improved by changing the speed and volume of utterances randomly to make the model robust to different listening environments.

Results and Analysis

The results demonstrated that CONF-TSASR outperformed several conventional speech recognition models, especially when it came to handling overlapping voices. Even when the target speaker’s voice was mingled with others, the CONF-TSASR system could still accurately capture what the target speaker was saying. Tests showed a strong capability to handle the speech of one specific person even when distractions were present.

When looking at how background noise affected performance, the model was found to be more sensitive with three overlapping speakers compared to two. This means while it maintained a good performance level, it did find it more challenging with increased levels of background noise and voices.

Furthermore, the model achieved strong results on a previously untested dataset, LibriSpeechMix. It was able to understand and transcribe the target speaker's speech well, showing its adaptability and effectiveness across different audio scenarios.

Conclusion

CONF-TSASR represents an important advancement in the field of target-speaker speech recognition. By focusing on one specific speaker, it has the potential to improve how we transcribe speech in noisy environments. Its design allows for ease of use with just one sample of the target speaker's voice, making it simpler to implement in real-world scenarios.

The model has proven its capabilities through rigorous testing and has established new benchmarks for both existing datasets and methods. This can lead to better speech recognition technologies in various applications, including virtual assistants, transcription services, and more. With its open-source release, further development and improvements can be expected from the community, promising exciting new possibilities in speech technology.

Original Source

Title: Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Abstract: We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). The model consists of a TitaNet based speaker embedding module, a Conformer based masking as well as ASR modules. These modules are jointly optimized to transcribe a target-speaker, while ignoring speech from other speakers. For training we use Connectionist Temporal Classification (CTC) loss and introduce a scale-invariant spectrogram reconstruction loss to encourage the model better separate the target-speaker's spectrogram from mixture. We obtain state-of-the-art target-speaker word error rate (TS-WER) on WSJ0-2mix-extr (4.2%). Further, we report for the first time TS-WER on WSJ0-3mix-extr (12.4%), LibriSpeech2Mix (4.2%) and LibriSpeech3Mix (7.6%) datasets, establishing new benchmarks for TS-ASR. The proposed model will be open-sourced through NVIDIA NeMo toolkit.

Authors: Yang Zhang, Krishna C. Puvvada, Vitaly Lavrukhin, Boris Ginsburg

Last Update: 2023-08-09 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2308.05218

Source PDF: https://arxiv.org/pdf/2308.05218

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles