Bridging Language Gaps: A Focus on Indian Languages
Supporting translation for low resource languages in India.
Hamees Sayed, Advait Joglekar, Srinivasan Umesh
― 6 min read
Table of Contents
- The Challenge of Translating Low Resource Languages
- Data Collection: The First Step
- Training the Model: Cooking Time
- The Importance of Each Language
- Assamese: The Friendly Neighbor
- Manipuri: The Fast Talker
- Khasi: The Storyteller
- Mizo: The Historical Hero
- The Data Prep: Getting Everything Ready
- Training Day: Recipe in Action
- Inference: The Taste Test
- Evaluation: How Did We Do?
- Limitations: What Could Be Better?
- Conclusion: The Road Ahead
- Original Source
- Reference Links
In our world, there are many languages spoken, but some of them just don't have enough resources for translation. Think of it like having a favorite dish that nobody knows how to cook. We're focusing on four languages from India: Khasi, Mizo, Manipuri, and Assamese. They need a little love in the translation department, and we're here to help!
The Challenge of Translating Low Resource Languages
Translating these languages can feel like trying to teach a cat to swim. It's tricky! While we’ve made great progress with languages like English and Spanish, Khasi, Mizo, Manipuri, and Assamese are left scratching their heads. Why? They don’t have enough bilingual resources, like books or websites, for machines to learn from.
Data Collection: The First Step
Our first step was to gather data. We searched high and low, but in a digital way, of course. We used datasets from various sources, aiming to collect as much bilingual material as we could. It’s like gathering ingredients for a fancy recipe – we needed the right mix to get started.
Since there wasn't much data available for Khasi and Mizo, we used a trick called Back-translation. Imagine you want to tell a joke in another language, but you can only remember it in English. You translate it to another language and then back to English. This helps create more examples for the translation model. It’s like playing telephone, but with fewer giggles and more words!
Training the Model: Cooking Time
Now that we have our ingredients, it's time to cook! We used a super-smart translation model called NLLB 3.3B. Think of it as a digital chef with 3.3 billion thoughts running through its head.
We started with something called masked language modeling. Don't worry, no masks were worn during this process! It just means we helped the model learn the language better using our monolingual data, so it wouldn’t trip on its own shoelaces later.
Next, we fine-tuned the model to translate from English to our four languages and back. For Khasi, which needed a bit of extra pampering because it wasn’t already supported, we added special tokens. It’s like giving it a unique spice so that it can handle the local flavors!
The Importance of Each Language
Let's talk a bit about our stars of the show!
Assamese: The Friendly Neighbor
Assamese is spoken in Assam, the land of tea and elephants! With over 15 million speakers, it’s kind of a big deal. This language has a long history, from being the official language in royal courts to being loved by millions today.
Manipuri: The Fast Talker
Manipuri is the cool kid from Manipur. With about 1.76 million speakers, it’s the most popular Tibeto-Burman language in India. If there's ever a race for growth, Manipuri would be sprinting right behind Hindi and Kashmiri!
Khasi: The Storyteller
Khasi is like the wise elder in Meghalaya. Roughly 1 million people speak it, and it carries rich stories and traditions. It’s often written in the Latin script, which is a bit like giving it a modern twist!
Mizo: The Historical Hero
Mizo is a language from Mizoram, spoken by around 800,000 people. It has a rich oral history and was brought to life in writing in the late 19th century. Imagine Mizo as the storyteller of the family, sharing tales of yore using the Latin script.
The Data Prep: Getting Everything Ready
Before we could put our model to work, everything needed to be prepped and polished. We used a toolkit called Moses (not the guy who split seas, but a handy software!) to smooth out our text data.
We got rid of the pesky non-printable characters – they’re the digital equivalent of crumbs that just don’t belong on a plate. Then, we made sure all the text looked the same across different formats. Consistency is key, just like in a good recipe!
Training Day: Recipe in Action
The training process took place on some powerful computers. We used Nvidia A6000 GPUs – think of them as the race cars of computers. They helped us speed up the process while making sure the cooking was just right.
The NLLB model is built on what we call a "Transformer" architecture. That’s a fancy way of saying our digital chef has a lot of tools and techniques up its sleeve to make translations better.
Inference: The Taste Test
After cooking up our translation model, it was time for the taste test! We used something called beam search to get the best translations possible. Imagine trying to find the best slice of cake in a bakery – you want the fluffiest, creamiest piece, right?
Evaluation: How Did We Do?
We needed to know if our model was worth its weight in flour. We used various scoring methods, including BLEU scores, to measure performance. We found that while Assamese translations did pretty well, Khasi, Mizo, and Manipuri needed a bit more work.
For instance, English to Khasi translations scored low, kind of like a poorly made sandwich. Meanwhile, Manipuri translations faced some challenges, making us realize that our back-translated data didn’t always hit the mark.
Limitations: What Could Be Better?
Even our model had its days where it wasn’t quite on point. One issue was our limited dataset size. Think of it as having a tiny kitchen with not enough pots and pans to cook a feast. A bigger dataset could help the model work wonders.
The quality of our back-translated data was another hiccup. Sometimes, the food doesn’t taste as good when it’s reheated. This means we need to sharpen our data generation techniques for the future.
We also noticed a gap between how well the model translated to English compared to Indic languages. It’s like our model could dance the tango perfectly but stumbled trying to do the cha-cha.
Lastly, our data might not truly represent the richness of real-life language use. It’s like training someone to cook using only one recipe instead of a whole cookbook.
Conclusion: The Road Ahead
In the end, our adventure into low resource language translation opened our eyes to the challenges and opportunities ahead. While we made some progress, there's still room for improvement.
By refining our models and gathering better data, we can hope to serve up translations that are as delightful as a homemade meal. Here’s to a future where Khasi, Mizo, Manipuri, and Assamese flourish in the world of translation, making it a little less lonely for these beautiful languages!
Title: SPRING Lab IITM's submission to Low Resource Indic Language Translation Shared Task
Abstract: We develop a robust translation model for four low-resource Indic languages: Khasi, Mizo, Manipuri, and Assamese. Our approach includes a comprehensive pipeline from data collection and preprocessing to training and evaluation, leveraging data from WMT task datasets, BPCC, PMIndia, and OpenLanguageData. To address the scarcity of bilingual data, we use back-translation techniques on monolingual datasets for Mizo and Khasi, significantly expanding our training corpus. We fine-tune the pre-trained NLLB 3.3B model for Assamese, Mizo, and Manipuri, achieving improved performance over the baseline. For Khasi, which is not supported by the NLLB model, we introduce special tokens and train the model on our Khasi corpus. Our training involves masked language modelling, followed by fine-tuning for English-to-Indic and Indic-to-English translations.
Authors: Hamees Sayed, Advait Joglekar, Srinivasan Umesh
Last Update: Nov 11, 2024
Language: English
Source URL: https://arxiv.org/abs/2411.00727
Source PDF: https://arxiv.org/pdf/2411.00727
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://www.latex-project.org/help/documentation/encguide.pdf
- https://ai4bharat.iitm.ac.in/bpcc/
- https://github.com/openlanguagedata/seed
- https://censusindia.gov.in/
- https://google.translate.com/
- https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/monolingual_line_processor.py