Advancements in Speech Recognition for People with Disabilities
New methods improve communication tools for individuals with speech difficulties.
Macarious Hui, Jinda Zhang, Aanchan Mohan
― 7 min read
Table of Contents
People with conditions like cerebral palsy and ALS often have a tough time speaking clearly. This can make it hard for them to get their needs across, especially in healthcare settings where clear communication is key. When doctors and patients can't understand each other, it can create problems. To fix this, we are working on a tool that can help these individuals communicate better with the help of technology.
However, many of the current Speech Recognition tools struggle with non-standard speech patterns, mainly because they haven't had enough practice with this type of speech. Tools that are meant for people who speak normally, like Whisper and Wav2vec2.0, aren't very good at picking up words when the speaker has a speech difficulty. This means there’s a big gap when trying to support people with Speech Difficulties using these tools.
One common way to test how well speech recognition works for people with speech issues is by using a dataset called TORGO. But there’s a catch: sometimes this dataset has overlapping phrases. This means that the same phrases are used by different speakers, which isn’t helpful when trying to train a tool to recognize speech.
We found a way to deal with this overlap problem, and we’re excited to share our findings!
The Challenge of Speech Difficulties
For many people with conditions like ALS and cerebral palsy, speaking can be a major hurdle. This is due to weakness or paralysis affecting the muscles used for speech. As a result, they might have slurred speech or unusual speech patterns, which can lead to miscommunication.
In healthcare settings, where accurate information is vital, these issues can decrease the quality of care. The good news is that there are tools designed to help, known as augmentative and alternative communication (AAC) tools. These tools are built to assist individuals with speech difficulties to express themselves better.
Modern AAC tools like SpeakEase offer the ability to recognize the user’s speech and convert it into text. This gives everyone a better chance to communicate. But the challenge here is that speech recognition tools often have limitations when it comes to understanding atypical speech.
A lot of the speech recognition technology is trained on "normal" speech, leaving those with speech difficulties in a tough spot.
Tackling Speech Recognition Problems
Speech recognition programs need enough data to learn effectively. Unfortunately, data for atypical speech is scarce. While there are many Datasets for typical speech, tools often hit a wall when trying to recognize atypical speech due to a lack of training examples. This makes it tough for speech recognition software to work well with people who have speech difficulties.
To build a better tool, one idea is to use a first-pass recognition system that guesses what the person is saying, and then uses a second step for error-correction, which can help with misunderstood words.
One part of our process involved checking if we could build a better dataset that doesn't include overlapping phrases. This leads to more accurate speech recognition for these individuals.
Evaluating Speech Recognition with TORGO
TORGO is commonly used to test how well speech recognition works for people with speech difficulties. It has recordings from eight speakers who have different levels of speech difficulties, as well as recordings from people with normal speech. The range of data includes single words and entire sentences, which helps to create a more balanced dataset.
However, there’s a significant amount of overlap in the phrases used across different speakers, which can distort the accuracy when testing new systems. If a phrase is already known because it’s been used before, it doesn’t truly test how well the tool can recognize the speech.
In our work, we paid close attention to this overlap issue because it can lead to inflated performance numbers. When reviewing the performance of speech recognition systems, it’s crucial to have a solid understanding of how the tool performs on its own without any advantages from memorized phrases.
Creating a Better Dataset
To improve the situation, we created a new dataset called NP-TORGO. This dataset was generated by carefully selecting phrases so that there’s no overlap between what the training speakers used and what the testing speakers used. Essentially, we wanted to make sure each speaker was tested with phrases they hadn’t encountered during training.
To achieve this, we used a mathematical approach that divides the phrases so that there are no duplicates in the training and testing groups. This way, we can better evaluate how the speech recognition system is functioning.
After solving the overlap issue, we wanted to see how this improved the performance of different speech recognition systems.
Experimenting with Speech Recognition
In our experiments, we checked out how various versions of the Wav2vec2 architecture performed with the new NP-TORGO dataset. We also looked at how well other off-the-shelf systems, like Whisper, performed when confronted with atypical speech.
During the process, we discovered some key points. One major finding was that when the speech recognition system was tested on the original TORGO dataset, it performed well. But when we tested it on NP-TORGO, the system struggled. This suggested that the original success was partially due to the overlap of phrases rather than true recognition capability.
We also evaluated how language models play a role in this process. Language models help predict what the next word should be based on what has already been said. In the context of NP-TORGO, we noticed that language models that were trained outside of the dataset seemed to perform better when there were no overlaps.
Results of Our Experiments
The results from our experiments shed light on how both the speech recognition and language models work together. We looked closely at the word error rates (WER) and other performance indicators to gauge the effectiveness of different approaches.
From our results, it was evident that simply using standard language models wasn’t enough in cases with atypical speech. Instead, we found that a cross-modal error-correction system called Whispering-LLaMA showed some promise.
This system takes audio input and uses that to improve the accuracy of the transcribed text generated by the speech recognition tool. While this was helpful in some ways, it also highlighted that there is still a long way to go before these systems can adequately support those with speech difficulties.
Conclusion for a Better Tomorrow
In our quest to improve communication for individuals with speech difficulties, we’ve come a long way, but there’s still much to do. While we’ve made strides in addressing the prompt overlap issue and leveraging error-correction systems, the fact remains that many speech recognition tools are not yet ready to serve those who need them most.
We hope that our findings will spark further research and development in this important area. By improving the tools available for those with speech difficulties, we can help ensure that everyone has access to clear and effective communication, making healthcare more accessible for all.
As we continue to delve into this critical field, we are optimistic that with more attention and resources, we can create a future where communication barriers are a thing of the past. After all, everyone deserves to be heard, even if their speech is a little less than perfect.
Title: Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO
Abstract: Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns. In healthcare settings, communication breakdowns reduce the quality of care. While building an augmentative and alternative communication (AAC) tool to enable fluid communication we found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data. Our work looks to leverage SOTA ASR followed by domain specific error-correction. English dysarthric ASR performance is often evaluated on the TORGO dataset. Prompt-overlap is a well-known issue with this dataset where phrases overlap between training and test speakers. Our work proposes an algorithm to break this prompt-overlap. After reducing prompt-overlap, results with SOTA ASR models produce extremely high word error rates for speakers with mild and severe dysarthria. Furthermore, to improve ASR, our work looks at the impact of n-gram language models and large-language model (LLM) based multi-modal generative error-correction algorithms like Whispering-LLaMA for a second pass ASR. Our work highlights how much more needs to be done to improve ASR for atypical speakers to enable equitable healthcare access both in-person and in e-health settings.
Authors: Macarious Hui, Jinda Zhang, Aanchan Mohan
Last Update: 2024-11-07 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2411.00980
Source PDF: https://arxiv.org/pdf/2411.00980
Licence: https://creativecommons.org/licenses/by-nc-sa/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.