Advancements in Speech Emotion Conversion Technology
A new approach to changing emotions in speech amidst real-world noise.
― 6 min read
Table of Contents
Speech emotion conversion is the process of changing the emotion expressed in spoken words while keeping the original meaning and the speaker's identity intact. This technology is important for creating more natural interactions between humans and machines, especially in areas where emotional expression is crucial. Yet, achieving realistic emotional speech generation remains a challenge.
This article focuses on speech emotion conversion in noisy, real-world conditions where there is no matching data available for reference. In simpler terms, we are looking at how to change the emotion in speech without having a perfect example to work from. This creates difficulties because we have to separate the speech into different parts: what emotions are being expressed, who is speaking, and what the words are saying.
Methodology
In this approach, we use special networks that help us break down speech into its parts: the words, the speaker's voice, and the emotional tone. After separating these elements, we use a system called HiFiGAN to recombine them into a new piece of speech that reflects the desired emotion.
To manage how strongly the new emotion comes through, we focus on a certain aspect of emotion called "Arousal." This refers to how excited or calm a person feels, rather than labeling the emotion as happy or sad. By doing this, we can more effectively control how intense the emotion sounds in the final speech output.
Training the System
The training process involves using a large dataset of spoken podcasts where emotions are labeled. We specifically concentrate on how aroused or calm the speech sounds, rather than just categorizing emotions. This focus on continuous scales allows us to handle emotion intensity more effectively.
To train our system, we begin with the audio of spoken words, which we break down into its components. We use different types of encoders for this process.
- Lexical Encoder: This part handles the words being spoken. It takes the raw audio and processes it to extract the phonetic details.
- Speaker Encoder: This part identifies who is speaking by analyzing the qualities of their voice.
- Emotion Encoder: Instead of relying on categorical labels, this part works with a simple score that indicates how aroused the speech is.
After processing with these encoders, we have distinct representations for the words, the speaker's voice, and the emotional tone.
Next, we feed these representations into the HiFiGAN, a neural network that generates high-quality speech. It uses the separate components to create a new audio output that reflects the desired emotional tone while retaining the original words and the speaker's voice.
Challenges with Data
Most datasets used for training speech emotion conversion systems are created in controlled environments where actors read lines with specific emotions. These can be very different from spontaneous speech, which is messier and more complex.
In real-world situations, it is not always feasible to collect parallel datasets where each spoken line has a matching emotional counterpart. This is why we focus on non-parallel data. Models that can work with such data are more flexible since they don’t rely on exact emotion pairs.
However, non-parallel data also presents challenges. We need to ensure that the system can still separate and reassemble the emotional, lexical, and speaker components without having a direct example to work from.
Self-Supervised Learning
To tackle the challenges of working with non-parallel data, we use a method called self-supervised learning (SSL). This technique utilizes large amounts of unlabeled data to improve the training process. By doing this, we can better understand the various speech elements involved in emotion expression and improve the quality of the generated speech.
SSL has proven to be effective in related tasks, such as recognizing emotions in speech and converting voices from one speaker to another. By applying this method, we hope to gain better insight into how to separate and reconstruct the elements of speech.
Resynthesis Process
Once we have our separate components, the next step is to recombine them into a natural-sounding speech output. This is where the HiFiGAN plays a crucial role. It takes the separate parts and generates high-quality audio that reflects the intended emotion.
The HiFiGAN uses a combination of techniques to ensure that the final output sounds realistic. This includes adjusting the pitch and other vocal qualities to match the emotional tone we are aiming for.
In our training, we also assess how well the system performs in generating emotionally expressive speech. We look at how close the generated speech matches the intended emotional content and how natural the voice sounds.
Testing and Validation
To measure the effectiveness of our approach, we conduct tests using a specific dataset that contains podcast audio. The emotions in this dataset are labeled based on arousal, valence, and dominance. However, for our study, we focus primarily on arousal.
During testing, we assess both how well the emotional content is converted and how natural the output sounds. We compare different versions of our model to see which combination of components produces the best results.
We find that using a combination of all representations-words, speaker identity, and emotions-leads to the most natural-sounding speech. This suggests that conditioning the HiFiGAN on all three aspects improves both the emotional expression and the realism of the output.
Results and Observations
When we analyze the results of our experiments, we see several trends. First, we observe that larger speech segments tend to result in better emotion conversion. This is likely because longer segments provide more context for the model to work with.
Moreover, we discover that our method performs better with moderate emotional arousal levels compared to extreme ones. This means that while the system can convert emotions effectively, it is more successful when dealing with emotions that are not at the extremes of the scale.
In addition to quantitative assessments like mean-squared errors and naturalness scores, we also conduct qualitative analyses. By examining audio samples and visualizing the spectrograms, we can understand how well the emotional tones are represented.
For example, when synthesizing speech with high arousal, we notice that the pitch tends to be higher and more variable than in lower arousal speech. This aligns with existing ideas that suggest people speak with a higher pitch when excited or emotional.
Conclusion
In summary, our work highlights the potential of speech emotion conversion within real-world contexts. By focusing on separating the emotional, lexical, and speaker components of speech, we can generate more dynamic and realistic emotional expressions through synthesized speech.
The results indicate that our methodology improves the naturalness and emotional accuracy of the output. Achieving this in an unsupervised manner, especially with in-the-wild data, is a significant step forward.
As technology continues to advance, the applications of speech emotion conversion will likely expand, paving the way for more emotionally aware human-machine interactions. The findings from this research can serve as a foundation for future studies aiming to refine and enhance the emotional expressiveness of synthesized speech.
Title: In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised Representations and Neural Vocoder-based Resynthesis
Abstract: Speech emotion conversion aims to convert the expressed emotion of a spoken utterance to a target emotion while preserving the lexical information and the speaker's identity. In this work, we specifically focus on in-the-wild emotion conversion where parallel data does not exist, and the problem of disentangling lexical, speaker, and emotion information arises. In this paper, we introduce a methodology that uses self-supervised networks to disentangle the lexical, speaker, and emotional content of the utterance, and subsequently uses a HiFiGAN vocoder to resynthesise the disentangled representations to a speech signal of the targeted emotion. For better representation and to achieve emotion intensity control, we specifically focus on the aro\-usal dimension of continuous representations, as opposed to performing emotion conversion on categorical representations. We test our methodology on the large in-the-wild MSP-Podcast dataset. Results reveal that the proposed approach is aptly conditioned on the emotional content of input speech and is capable of synthesising natural-sounding speech for a target emotion. Results further reveal that the methodology better synthesises speech for mid-scale arousal (2 to 6) than for extreme arousal (1 and 7).
Authors: Navin Raj Prabhu, Nale Lehmann-Willenbrock, Timo Gerkmann
Last Update: 2023-06-02 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2306.01916
Source PDF: https://arxiv.org/pdf/2306.01916
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.