Simple Science

Cutting edge science explained simply

# Electrical Engineering and Systems Science# Computation and Language# Sound# Audio and Speech Processing

Improving Chinese Speech Recognition Through Pinyin Regularization

This study presents a dataset and method to enhance Chinese ASR accuracy using Pinyin.

― 7 min read


Pinyin for Better ChinesePinyin for Better ChineseASRChinese speech recognition systems.Using Pinyin boosts error correction in
Table of Contents

Automatic speech recognition (ASR) systems are widely used in many applications including voice search, commands, and transcription services. However, these systems can struggle with various factors that can affect performance, such as background noise, the different accents of speakers, and the quality of the audio. When ASR outputs are incorrect, especially in tough situations, it can negatively impact the applications that rely on them. To improve the accuracy of ASR outputs, implementing error correction methods becomes very important.

In recent times, large language models (LLMs) have shown promise in helping with error correction in speech recognition. Much of the research in this area has focused on the English language, so this paper shifts attention to Chinese speech recognition. A new specialized dataset has been created specifically for error correction in Chinese ASR. This dataset contains 724,000 pairs of audio transcriptions and hypotheses, named the Chinese Hypotheses Paradise dataset (ChineseHP). It covers a wide variety of scenarios, making it a significant challenge for error correction efforts.

This dataset is made from the ASR outputs of a modified version of Whisper, a well-known model in this field. The ChineseHP dataset includes different types of spoken content such as reading speech, broadcast news, meetings, and telephone conversations, as well as various accents and dialects. The aim is to ensure the dataset is representative of real-world situations.

A key challenge in recognizing Chinese speech is that it is a logographic language. This means that the pronunciation of characters does not directly relate to how they are written. Pinyin is a system that uses Roman letters to represent the sounds of Chinese characters. It is frequently used in China for teaching the language and is also a common method for typing out Chinese characters on devices. Pinyin is beneficial for LLMs because it can help make sense of Chinese pronunciations.

Chinese language contains many homophones, meaning different characters can sound the same. For example, the characters for "desk" and "catch" are both pronounced "zhuo." This can confuse ASR systems, leading to errors. However, a Pinyin transcription from the text hypothesis often shows a lower error rate than the text itself, making it useful for correcting mistakes.

To take advantage of this, a method called Pinyin regularization is proposed. This involves including Pinyin transcriptions directly from the textual hypotheses in both the prompts for LLMs and during their Fine-tuning. The results of experiments show that using Pinyin regularization can significantly improve the ability of LLMs to correct errors in Chinese speech.

The article is divided into several sections. The first section introduces the Chinese Hypotheses Paradise dataset. The next section explains the Pinyin regularization method. The following part outlines the experimental setup and findings, leading to a conclusion at the end.

Chinese Hypotheses Paradise Dataset

The ChineseHP dataset includes a large number of audio samples taken from recognized spoken content. It was created using the outputs of a Chinese-focused version of Whisper, called Belle-distilwhisper-large-v2-zh. Various sources were used to compile the dataset, including Aishell-1, Wenetspeech, Aishell-4, and Kespeech. This diversity ensures that the dataset is representative of different speaking situations.

Aishell-1 consists of standard reading speech, while Wenetspeech brings in content from different areas of the internet. It includes test sections for broadcast news and meetings. Aishell-4 focuses on telephone conversations, and Kespeech emphasizes dialects. Since Wenetspeech and Kespeech contain a lot more data than Aishell-1 and Aishell-4, the dataset samples were balanced by taking 200,000 utterances from each.

For generating the audio samples, a technique called ASR beam search decoding was used. This process helped create the top 10 hypotheses for each audio sample, which were then paired with the correct transcriptions. The statistical details of the dataset reveal its diversity, showcasing regular speech, broadcast news, meetings, and varied accents.

Pinyin Regularization

Pinyin, or Hanyu Pinyin, is a popular Romanization system for Mandarin Chinese. It uses 23 initials, 24 finals, and 5 tones, including the neutral tone, to represent speech sounds. Some initials and finals may vary slightly between different systems, but basic rules remain the same. In this study, a specific version of Pinyin is employed that uses "ü" instead of "v" and "en" instead of "n" for some finals, as these forms are more common in China.

The sounds of Chinese characters are created by combining initials and finals. For instance, the character "你" is pronounced as "ni3," with "n" as the initial and "i" as the final, while "3" indicates its tone. There are also homophones, where different characters sound the same, as well as heteronyms, where the same character has different pronunciations depending on context.

These factors can confuse ASR systems, especially in noisy environments where accents or dialects might alter the expected output. While a character may be misrecognized, the corresponding Pinyin from the text hypothesis is often accurate, leading to lower errors in this representation. This makes Pinyin valuable for error correction.

Pinyin-Regularized Prompts

For the experiments, two types of prompts have been developed: one for directly engaging with pre-trained LLMs and another for fine-tuning these models. The first prompt type is structured to include both text hypotheses and corresponding Pinyin. To help manage the output better, the model is instructed to respond in a JSON format.

The fine-tuning prompts are designed specifically for models like ChatGLM, which is well-suited for the Chinese language. The training data combines pairs of hypotheses and transcriptions from the ChineseHP dataset, allowing for improved performance in error correction tasks.

Experimental Framework and Findings

To assess the effectiveness of different prompt styles, experiments were conducted using selected samples from the ChineseHP dataset. The experiments focused on how various prompts affected the performance of ChatGPT in correcting errors.

Different prompts were crafted, and their effectiveness was measured using a metric called character error rate reduction (CERR). The results indicated that including Pinyin in the prompts led to significant performance improvements in correcting errors. The accuracy of the model's responses was directly linked to the precision of the Pinyin provided.

In an effort to see whether using the best text hypothesis would yield similar benefits, a comparison was made. However, results showed that relying solely on text without Pinyin produced less effective outcomes, highlighting the advantages of integrating Pinyin for better performance.

Fine-tuning with ChatGLM also showed promising results, particularly when Pinyin was included in the training process. Experiments highlighted the challenges faced with more complex tasks, but there were noticeable enhancements in model performance with the use of Pinyin. The findings suggest that integrating Pinyin not only helps with error correction but also supports better understanding by the LLMs.

Case Analysis

Two cases were examined to analyze how different prompts performed in correcting errors. The first case, using content from standard reading samples, demonstrated effective correction with Pinyin regularization even when relying on the best hypothesis. The second case, which involved more complex speech with various errors, showed that while performance dropped due to the challenges, Pinyin regularization still helped reduce mistakes.

Conclusion

This study introduces a significant new dataset for error correction in Chinese ASR, named the Chinese Hypotheses Paradise dataset (ChineseHP). It highlights the importance of a diverse range of speech scenarios and presents a method for improving the precision of LLMs through Pinyin regularization. Moving forward, the focus will be on developing more advanced fine-tuning methods, creating better prompts, and utilizing additional training resources to further refine the capabilities of LLMs for Chinese ASR error correction.

Original Source

Title: Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models

Abstract: Recent studies have demonstrated the efficacy of large language models (LLMs) in error correction for automatic speech recognition (ASR). However, much of the research focuses on the English language. This paper redirects the attention to Chinese. Firstly, we construct a specialized benchmark dataset aimed at error correction for Chinese ASR with 724K hypotheses-transcription pairs, named the Chinese Hypotheses Paradise dataset (ChineseHP), which contains a wide range of scenarios and presents significant challenges. Subsequently, we conduct a preliminary evaluation using the dataset for both direct-prompting and fine-tuning pre-trained LLMs. Furthermore, we propose a straightforward method of Pinyin regularization for prompts, which involves the transcription of Pinyin directly from text hypotheses. The experimental results reveal that Pinyin regularization consistently enhances the error-correcting ability of LLMs when compared with those without regularization. The dataset is available on the website.

Authors: Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang

Last Update: 2024-07-01 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2407.01909

Source PDF: https://arxiv.org/pdf/2407.01909

Licence: https://creativecommons.org/licenses/by-nc-sa/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles