Simple Science

Cutting edge science explained simply

# Electrical Engineering and Systems Science# Audio and Speech Processing# Computation and Language# Sound

Improving Emotional Recognition and Synthesis in Speech Models

New techniques enhance emotional understanding in speech processing tasks.

― 6 min read


Advancing Speech EmotionAdvancing Speech EmotionRecognition Techniquesrecognition in speech models.New method enhances emotional
Table of Contents

Speech Emotion Recognition (SER) and Emotional Text-to-Speech (TTS) are two important tasks in the field of speech processing. SER focuses on understanding emotions from spoken words, while Emotional TTS aims to create speech that conveys emotions when given text. Both of these tasks are becoming more popular as machine learning models improve in mimicking human emotions.

A key factor for success in both tasks is how well speech emotions are represented. Good emotional representations can help with recognizing emotions in speech and generating more expressive spoken language. However, there is a common problem: there is often not enough emotional speech data available. Most research does not account for this issue of data imbalance.

Emotional speech data is harder and more expensive to collect than neutral speech, which leads to a focus on neutral data. This can cause models to favor the neutral emotion and struggle to recognize or generate emotional speech effectively. To tackle this, it is important to find ways to extract emotional representations that work well despite the lack of balanced data.

The Challenge of Imbalanced Datasets

One of the main challenges in training models for SER and Emotional TTS is the availability of data. Most datasets end up favoring neutral speech, which means emotional classes have fewer examples. This imbalance can cause models to perform poorly on recognizing or producing emotional speech.

Data Augmentation is a technique that helps in dealing with imbalanced datasets. By creating new, altered examples from the existing data, augmentation can reduce bias towards the more common neutral class. Some strategies include generating speech data using techniques like Generative Adversarial Networks (GANs).

Other methods have also been tried, such as creating new examples by mixing features from existing data. However, much of the focus has been on generating more speech data instead of improving emotional representation directly.

Importance of Emotional Representation

For effective Emotional TTS, having strong emotional representations is key. These representations can help produce speech that conveys the appropriate emotion. Some approaches have used style tokens that represent emotional features extracted from speech samples. These tokens are then applied to synthesized speech to enhance expressiveness.

More advanced methods, like RFTacotron, use sequences of vectors to capture emotional styles in detail. While these techniques are promising, they often struggle with imbalanced datasets. Models may easily overfit to the dominant neutral class, resulting in less expressive outputs.

To address this, a method called Mixup has gained popularity. This technique blends existing input samples to create new training examples. Studies have shown that Mixup improves performance in various tasks, including speech recognition.

A New Approach to Learning Emotional Representations

In this work, we propose a new method that combines different types of Mixup augmentation to learn emotional representations effectively from imbalanced data. By integrating both raw and latent-level Mixup, we can leverage the strengths of both methods.

In the raw-level Mixup, two speech samples are combined to create a new sample that helps the model learn a wider variety of data. This allows the model to identify various structures within the data, which is important for developing a more robust representation. In latent-level Mixup, emotional representations are mixed at an intermediate activation level, which can lead to deeper and more expressive features.

Combining both Mixup types ensures that the emotional representations learned are consistent and generalizable across different datasets. This consistency helps models avoid relying too much on specific features that could vary between training and real-world scenarios.

Training the Emotion Extractor

To learn effective emotional representations, we train a model called the Emotion Extractor. The training process involves using both raw and latent-level Mixup techniques to create new training samples and gain valuable emotional features from the speech data.

The Emotion Extractor processes speech samples to derive emotional representations. These representations can then be utilized in both SER and Emotional TTS tasks. During training, the model updates its understanding based on the emotional labels associated with the samples. This helps the model learn to differentiate between various emotions effectively.

Using the Emotion Extractor for SER and TTS

For the SER task, the Emotion Extractor can be directly applied to detect emotions in speech. We modify a widely-used deep learning model called VGG19 for this purpose. By adapting VGG19, we can extract features from the speech input while focusing on emotional content.

In the Emotional TTS task, we utilize a model called RFTacotron, which transforms text into speech using the emotional representations learned from the Emotion Extractor. The architecture of the Emotion Extractor aligns with the needs of the TTS model, allowing for a seamless integration of emotional features during speech synthesis.

Training Process and Data Used

Training involves using specific datasets for both the SER and TTS tasks. For SER, we work with datasets that contain emotional speech samples as well as neutral samples. By artificially reducing the number of emotional samples, we can simulate the common imbalances found in real-world data.

For the Emotional TTS task, we select a dataset specifically designed for generating emotional speech. Similar to the SER datasets, we only retain a limited number of emotional samples per class to replicate the data imbalance challenge.

The preprocessing of speech samples includes resampling to ensure consistent quality. Acoustic features are then extracted for effective analysis during the training phase.

Results and Findings

After training, we conduct experiments to evaluate the performance of our models on both the SER and TTS tasks. We use multiple datasets to ensure reliable results and validate the effectiveness of our proposed approach.

For the SER task, we observe that our model significantly outperforms existing baselines on imbalanced datasets. The emotional representations extracted from our Emotion Extractor lead to clear and accurate emotion detection, even in challenging scenarios.

In the TTS task, we find that our model synthesizes more expressive speech. The emotional representations contribute positively to the quality of the generated speech, making it sound more natural and emotionally rich when compared to traditional models.

Conclusion

In summary, we present a new method for extracting emotional representations from imbalanced speech data. By combining different augmentation techniques, we enhance the performance of both Speech Emotion Recognition and Emotional Text-to-Speech models. Our experimental results show that this approach leads to more robust and effective emotional representations, enabling models to perform better even when training data is limited.

Original Source

Title: Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech

Abstract: Effective speech emotional representations play a key role in Speech Emotion Recognition (SER) and Emotional Text-To-Speech (TTS) tasks. However, emotional speech samples are more difficult and expensive to acquire compared with Neutral style speech, which causes one issue that most related works unfortunately neglect: imbalanced datasets. Models might overfit to the majority Neutral class and fail to produce robust and effective emotional representations. In this paper, we propose an Emotion Extractor to address this issue. We use augmentation approaches to train the model and enable it to extract effective and generalizable emotional representations from imbalanced datasets. Our empirical results show that (1) for the SER task, the proposed Emotion Extractor surpasses the state-of-the-art baseline on three imbalanced datasets; (2) the produced representations from our Emotion Extractor benefit the TTS model, and enable it to synthesize more expressive speech.

Authors: Shijun Wang, Jón Guðnason, Damian Borth

Last Update: 2023-06-09 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2306.05709

Source PDF: https://arxiv.org/pdf/2306.05709

Licence: https://creativecommons.org/licenses/by-nc-sa/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles