Advancements in Real-Time Speech Translation Systems
A new system for accurate and fast speech translation across multiple languages.
― 6 min read
Table of Contents
This article discusses a new system for translating spoken language in real-time. The focus is on translating English into German, Japanese, and Chinese, as well as translating spoken English directly into spoken Japanese. The system combines different technologies to improve the quality of translation while reducing delays, which is crucial for real-time communication.
Simultaneous Translation
Simultaneous translation means that the translation happens as the speaker is talking. Translators need to listen and translate quickly to avoid delays. This requires a system that can handle speech accurately and produce translations that sound natural. Traditional methods often involved separate steps for recognizing speech and then translating it, which caused delays and errors.
Recent advancements have led to the development of systems that can process speech more efficiently. These systems can take spoken input and provide translations almost instantly by using a single model rather than multiple steps.
Model Development
For this project, we built a new translation model that uses two pre-trained models: one for understanding spoken language (HuBERT) and another for converting text into speech (mBART). By merging these two models, we aim to create a more efficient translation system.
We trained our model using two approaches to decoding. The first is called Local Agreement (LA), which focuses on finding stable translation outputs. The second is AlignAtt, which uses attention mechanisms to align spoken words with their translations.
Decoding Policies
Local Agreement (LA)
The LA method searches for the longest common parts of the translation as the input is being spoken. It checks whether the translation remains consistent as it processes chunks of speech. If the translation agrees over several steps, it is deemed more reliable.
AlignAtt
AlignAtt utilizes attention to find connections between source and target words. If a word in the translation aligns with chunks of the spoken input, it produces that translation. If it does not, it waits for more speech to be processed. This method can help reduce latency, which is the delay between the spoken input and the produced translation.
Speech-to-text Translation
Our speech-to-text system works by translating spoken language into written text. We used a combination of pre-trained models that have been developed through previous research. These models require a lot of training data, which we sourced from existing bilingual speech translation datasets.
The models are designed to handle multiple languages, making the system versatile. We also implemented a method called Inter-connection which allows the speech recognition and text translation parts of the model to share information effectively.
Speech-to-speech Translation
The translation from speech to speech is done in two main steps: first, we convert the spoken input into text, and then we use a text-to-speech (TTS) system to produce spoken output in the target language.
The TTS system is made up of several modules. First, it predicts the sounds of words (phonemes) and symbols that indicate speech features such as pitch and rhythm. Then, it generates the necessary speech sounds based on these predictions.
Improvements in TTS
In our previous work, the TTS output did not sound as natural due to quality issues in synthesized speech and mistakes made during the speech recognition phase. We’ve upgraded our TTS system by incorporating a new architecture that improves how phonemes and speech features are predicted.
The updated TTS system uses a method called the Transformer architecture, which has shown better performance in generating natural-sounding speech.
Experimental Setup
Data Sources
We trained our translation models using various datasets. For speech-to-text, the data included numerous examples of people speaking in English, German, Japanese, and Chinese. This training helps the model learn how to accurately capture different languages and their nuances.
For the TTS system, we used a specific Japanese speech dataset that provides enough material for the model to learn the sounds and rhythms specific to the Japanese language.
Training Process
The training process involves providing the model with a lot of examples so it can learn how to respond appropriately. We adopted various strategies to ensure that the model could handle different scenarios effectively.
During training, we made adjustments to the model settings to find the best balance between quality (how good the translations are) and latency (how fast the translations occur).
Results
After training, we evaluated the translation systems to see how well they performed. We looked at several metrics, including translation accuracy and the time it took to produce translations.
Speech-to-Text Performance
In our testing, models using the LA approach generally produced better translation quality compared to those using AlignAtt. However, the AlignAtt model showed better results in situations where low latency was crucial.
Speech-to-Speech Translation Performance
For speech-to-speech translation, our updates led to improvements in how the synthesized speech sounded. The new TTS system produced more natural results, contributing positively to overall translation quality.
Quality vs. Latency
A significant consideration in simultaneous translation is the trade-off between quality and latency. Higher quality translations often require more processing time, which can lead to delays.
In our findings, we noted that the LA policy, while more accurate, could cause longer wait times under certain conditions. In contrast, AlignAtt could reduce delays but sometimes produced less reliable translations.
Our results highlighted the need for continuous improvement in both quality and speed across different translation modes.
Future Work
Moving forward, we plan to explore additional methods and enhancements to further improve our translation systems. This will include refining our approach to producing more stable prefixes for TTS and testing different model architectures.
We also aim to expand the system’s capabilities to include more languages and dialects to reach a broader audience.
Conclusion
In summary, this article presents an overview of a new system designed for real-time speech translation. Through advancements in both speech-to-text and text-to-speech technologies, we are able to provide translations that are not only faster but also more accurate. The findings suggest that by balancing quality and latency, we can create more efficient systems that cater to the needs of users in real-time scenarios.
As we continue to refine our technology, there is promise for even greater improvements in the future, enhancing how people communicate across language barriers.
Title: NAIST Simultaneous Speech Translation System for IWSLT 2024
Abstract: This paper describes NAIST's submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-{German, Japanese, Chinese} speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance.
Authors: Yuka Ko, Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Tomoya Yanagita, Kosuke Doi, Mana Makinae, Haotian Tan, Makoto Sakai, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura
Last Update: 2024-06-30 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2407.00826
Source PDF: https://arxiv.org/pdf/2407.00826
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.