HK-LegiCoST: Bridging Cantonese Spoken and Written Language

A new corpus for translating Cantonese audio to English text.

2025-10-29T11:59:36+00:00 ― 5 min read

Table of Contents

Original Source
Reference Links

In recent years, there has been a growing interest in translating spoken language into written text, especially for applications like automatic video captions and learning foreign languages. While most research has focused on widely spoken languages, there is a lack of studies on languages that are mostly spoken or where the spoken form is quite different from the written form. Cantonese is one such language, where the written version often resembles Mandarin more than it reflects how people actually speak.

To tackle this issue, we have developed HK-LegiCoST, a new collection of translations from Cantonese to English. This collection includes over 600 hours of Audio Recordings in Cantonese, along with written Transcripts in standard Chinese and English translations. The audio consists of conversations and speeches from the Hong Kong Legislative Council, focusing on topics related to government policy, discussions, and debates.

Challenges in Creating the Corpus

Creating this collection comes with some notable challenges. One of the main problems is aligning the spoken audio with its written transcript at the sentence level. Due to the differences in how Cantonese is spoken compared to how it is written in standard Chinese, the transcripts are not exact matches, which complicates the process.

To create this resource, we first had to collect data from various meetings held by the Hong Kong Legislative Council. The meetings span a range of topics related to governance and policy. The next step involved converting video recordings into audio files, and then extracting text from the corresponding transcripts.

Data Collection and Processing

The raw data was gathered from video recordings of council meetings from 2016 to 2021. The meetings addressed various issues such as education reform, housing, healthcare, and economic policies. The first task was to convert these videos into audio files, followed by a process called segmentation, which breaks down the audio into smaller, manageable parts based on the topics discussed.

Next, we had to clean up the transcripts from the recordings. This involved filtering out irrelevant information and dividing the text into smaller segments that correspond to the audio clips. We organized the text based on who was speaking and matched it to the audio for easier alignment.

Aligning Text and Audio

One crucial step in creating our resource is aligning the written text with the audio. This requires a method to match sentences in the audio to sentences in the transcripts. To do this, we used a technique that involves creating sentence embeddings, which are representations of sentences in mathematical terms. By comparing these embeddings, we can find similar sentences in both the spoken and written forms.

We also trained an Automatic Speech Recognition (ASR) model specifically for Cantonese. This model helps to convert the spoken audio back into written text, making it easier to align with the transcripts. However, since the transcripts are not exact matches for what is spoken, this adds an extra layer of difficulty.

First-Pass and Sentence-Level Alignment

To start the alignment process, we performed an initial, rough alignment that matched audio segments to sections of the text. Using voice activity detection tools, we could isolate the parts of the audio that contained speech. After that, we developed a more precise method to align sentences.

For longer audio segments, we found it challenging to decode the audio accurately. To manage this, we created a flexible alignment algorithm that breaks down lengthy segments into smaller parts. This algorithm also filters out any text that does not correspond to speech, enhancing the accuracy of our alignments.

Linguistic Features of the Corpus

In analyzing the data, we identified several interesting features of the Cantonese language as represented in our collection. One significant phenomenon is the reordering of words and phrases that occurs when spoken Cantonese is transformed into standard Chinese. For example, a phrase in Cantonese may be rearranged when it is written in standard Chinese, resulting in a different word order.

Another feature we noticed is the presence of long context dependencies, meaning that the meaning of certain words or phrases can depend on the preceding text in a document. This is common in formal settings like council meetings, where earlier discussions can influence later statements.

Baseline Experiments

To test our corpus, we established a few baseline experiments in automatic speech recognition and Machine Translation. Using our collection, we trained models to perform speech recognition tasks and translate the spoken language into English. We achieved competitive results with a model we trained solely on our data.

We also compared our machine translation efforts with existing systems. Our models performed better in translating named entities, which are often difficult for translation systems to handle correctly.

Conclusion

The HK-LegiCoST corpus serves as an important resource for studying speech recognition and translation for Cantonese. It consists of a vast amount of audio and text data that captures the linguistic characteristics of the Cantonese language, along with the unique challenges presented by the differences between spoken and written forms.

By sharing this resource, we aim to contribute to the understanding of how to better translate and recognize spoken languages, particularly those like Cantonese that have their own complexities. This work is a step towards advancing the field of speech translation and improving the technology available for languages that are often overlooked.

Additionally, we are in the process of making this corpus publicly available, as we want others in the research community to benefit from our findings and contribute to future advancements in this area. We appreciate the support and resources provided by the Legislative Council of the Hong Kong Special Administrative Region, which made this project possible.

HK-LegiCoST: Bridging Cantonese Spoken and Written Language

A new corpus for translating Cantonese audio to English text.

#Challenges in Creating the Corpus

#Data Collection and Processing

#Aligning Text and Audio

#First-Pass and Sentence-Level Alignment

#Linguistic Features of the Corpus

#Baseline Experiments

#Conclusion

Reference Links

Referenced Topics