Addressing Speech Disfluencies in Indian English
New dataset aims to improve understanding of stuttering in voice assistants.
Priyanka Kommagouni, Vamshiraghusimha Narasinga, Purva Barche, Sai Akarsh C, Anil Vuppala
― 6 min read
Table of Contents
- The Importance of Differentiating Disfluencies
- Introducing IIITH-TISA: A New Dataset
- A Closer Look at Speech Patterns
- Challenges in Researching Stuttering
- Early Detection of Stuttering in Children
- Understanding Disfluency Types
- Building the Dataset
- What Makes a Good Feature?
- How Does Classification Work?
- The Role of Shifted Delta Cepstra (SDC)
- Breaking Down the Dataset Collection
- Evaluating the Models
- Results of the Research
- Conclusion and Future Directions
- Acknowledgments
- Original Source
- Reference Links
When people talk, things rarely go perfectly. You might hesitate, repeat a word, or have a little pause. These hiccups in Speech are called Disfluencies. Some disfluencies are normal—like when you say "um" or "uh." These are typical. Others, especially ones seen in people who stutter, can be more serious and show signs of a speech disorder. Understanding the difference is important, especially for creating better voice assistants that can help those who stutter.
The Importance of Differentiating Disfluencies
Voice assistants often misunderstand when someone finishes speaking. For people who stutter, this can lead to frustration and interruptions at awkward moments. It’s a bit like trying to tell a joke, but someone keeps cutting in before the punchline. Recognizing the difference between typical and atypical disfluencies can help with early diagnosis of Stuttering in kids, making sure they get the right help before things get complicated.
Dataset
Introducing IIITH-TISA: A NewTo tackle the issue of speech disfluencies in Indian English, a new dataset called IIITH-TISA was created. Think of it as a treasure trove of speech samples that includes different kinds of speech stumbles. It’s the first of its kind in India and captures how people stutter in English. This dataset is important because most research has focused on British and American English, leaving a gap when it comes to Indian speakers.
A Closer Look at Speech Patterns
While studying speech, researchers found that typical disfluencies occur in about 6% of speech. That means if you say 100 words, 6 of them might come out as "um" or "like." On the other hand, stuttering can be a whole different ballgame, affecting around 70 million people globally. It’s essential to recognize that not all disfluencies are the same; they stem from different causes.
Challenges in Researching Stuttering
Research into stuttering has mainly focused on finding ways to detect and fix speech errors. However, many individuals who stutter find it annoying when voice assistants interrupt them too soon. Imagine talking, and a robot decides you’re done before you’ve even finished your sentence. That’s just rude! Some researchers are trying to adjust systems to make them more mindful, but it’s a tricky balance because what works for one person might not work for another.
Early Detection of Stuttering in Children
It's also vital to catch disfluencies early in children, as stuttering is often mistaken for normal language development hiccups. Kids as young as two may start to realize they have a stutter, which can make them hesitant to speak. Early intervention can make a world of difference, so identifying patterns in speech is key.
Understanding Disfluency Types
Types of disfluencies include different events like filled pauses, prolongations, and repetitions. Typical repetitions are common in everyday speech and usually don’t signal a problem. But for those who stutter, repetitions can be tied to physical tension in their voices. Studying how these variations manifest can help us make better tools for everyone.
Building the Dataset
The IIITH-TISA dataset was built to include various types of disfluencies. Using recordings from people who stutter, researchers collected diverse examples of speech. The team carefully selected recordings to ensure they captured the true nature of stuttering, focusing on natural speech without background noise. They annotated each clip to indicate when a disfluency occurred, amassing a collection of over 3,000 audio clips.
What Makes a Good Feature?
In speech analysis, "Features" are the characteristics we look at to help understand speech patterns. Researchers proposed using something called Perceptually Enhanced Zero-Time Windowed Cepstral Coefficients (PE-ZTWCC) for their analysis. It sounds fancy, but in simple terms, it helps capture the nuances of speech better, especially the differences in how typical and atypical disfluencies sound.
How Does Classification Work?
To classify the differences in speech, a shallow neural network (TDNN) was used. This means that the computer model looked at short bits of audio to figure out if someone was speaking typically or if they were stuttering. This is essential because analyzing longer snippets of speech can complicate things, especially with a smaller dataset.
The Role of Shifted Delta Cepstra (SDC)
To improve the model further, researchers added Shifted Delta Cepstra (SDC) features, which help capture changes over time in speech. By combining these features with the PE-ZTWCC, they created a powerful tool for distinguishing between different kinds of disfluencies. This is like adding a turbo boost to a car; it helps the model speed up its ability to recognize patterns.
Breaking Down the Dataset Collection
The dataset creation involved teamwork. A group of six students underwent training to learn how to spot and categorize different types of disfluencies. They paid attention to details like how long a stutter lasted and what kind of stutter it was. This collaborative effort made the dataset more accurate and useful for research.
Evaluating the Models
To see how well the model worked, researchers compared their new features to traditional speech analysis techniques. They tested various methods to measure how often the model correctly identified typical and atypical disfluencies. The results clearly showed that the PE-ZTWCC features outperformed the others, making them the better choice for recognizing speech patterns.
Results of the Research
When comparing the types of disfluencies, results indicated that repetitions were more easily identified than filled pauses or prolongations. It’s like recognizing someone’s laugh in a crowded room—there's something distinctive about it that stands out. This finding helps researchers understand how to better tailor their models to recognize different speech patterns.
Conclusion and Future Directions
The IIITH-TISA dataset represents a significant step forward in understanding speech disfluencies in the Indian context. It opens doors for future research aimed at improving voice assistants and speech therapy tools for those who stutter. By enhancing our understanding of speech patterns, we can create more inclusive technology that respects and accommodates different ways of communicating.
Acknowledgments
A big shoutout goes to all those who shared their stories and experiences. It’s a reminder that everyone has a voice, and sometimes, the best way to support one another is to listen—truly listen—before jumping in with solutions.
Original Source
Title: Typical vs. Atypical Disfluency Classification: Introducing the IIITH-TISA Corpus and Temporal Context-Based Feature Representations
Abstract: Speech disfluencies in spontaneous communication can be categorized as either typical or atypical. Typical disfluencies, such as hesitations and repetitions, are natural occurrences in everyday speech, while atypical disfluencies are indicative of pathological disorders like stuttering. Distinguishing between these categories is crucial for improving voice assistants (VAs) for Persons Who Stutter (PWS), who often face premature cutoffs due to misidentification of speech termination. Accurate classification also aids in detecting stuttering early in children, preventing misdiagnosis as language development disfluency. This research introduces the IIITH-TISA dataset, the first Indian English stammer corpus, capturing atypical disfluencies. Additionally, we extend the IIITH-IED dataset with detailed annotations for typical disfluencies. We propose Perceptually Enhanced Zero-Time Windowed Cepstral Coefficients (PE-ZTWCC) combined with Shifted Delta Cepstra (SDC) as input features to a shallow Time Delay Neural Network (TDNN) classifier, capturing both local and wider temporal contexts. Our method achieves an average F1 score of 85.01% for disfluency classification, outperforming traditional features.
Authors: Priyanka Kommagouni, Vamshiraghusimha Narasinga, Purva Barche, Sai Akarsh C, Anil Vuppala
Last Update: 2024-11-26 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2411.17149
Source PDF: https://arxiv.org/pdf/2411.17149
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.