Advancing Speech Technology for African Languages
New dataset AfroDigits aims to improve speech recognition in African languages.
― 5 min read
Table of Contents
The growth of speech technology has been impressive, but there are still challenges when it comes to including African Languages. A lack of Audio data in these languages has led to limited support in Speech Recognition tools. To tackle this problem, AfroDigits was created. It is a dataset made up of spoken digits for 38 African languages. This dataset aims to assist in the development of speech applications, such as recognizing spoken telephone numbers.
Datasets play a crucial role in improving deep learning models used in natural language processing (NLP). A well-known example is ImageNet, which has shown how effective deep neural networks can be for image recognition. The more datasets available for a specific task, the better the model can become. In the realm of speech processing, end-to-end deep learning models have advanced automatic speech recognition (ASR) and speech synthesis (TTS). However, due to a lack of data, many existing technologies do not support African languages.
When African languages are not included in speech technologies, it risks overshadowing the identities and cultures of those who speak them. The AfroDigits project aims to fill the gap by creating a spoken digits dataset that caters to all African languages. This effort uses a community-based approach, encouraging local involvement in building the dataset.
The structure of this article will follow the motivation behind AfroDigits, an overview of data collection efforts, the details of the project, and the description of the dataset. Finally, it will cover experiments conducted with the dataset and discuss the results.
Related Efforts in Speech Corpora
There have been various attempts to create speech datasets for different processing tasks. Some prominent datasets, like LibriSpeech and TIMIT, have made significant contributions. However, these datasets do not support African languages. Recently, multilingual datasets like Vox-Forge and Mozilla's Common Voice have emerged, but the number of African languages represented remains low. Common Voice, for instance, only includes Kinyarwanda with over 1000 hours of audio.
While some projects have aimed to fill this gap, most have focused on text-speech corpora rather than digits. The FSDD dataset, similar in use case to AfroDigits, is primarily English-based. AfroDigits aims to contribute to the community by focusing on recording digits in African languages.
The AfroDigits Project
AfroDigits is designed as a community-driven tool for collecting audio digit data. The choice of spoken digits was intentional, aiming to create a simple dataset that could be beneficial for speech processing tasks. This dataset can serve educational purposes, such as helping researchers and practitioners learn about speech processing in their native languages.
A major factor in the project's success is the ease of participation. The team created an online platform that requires no technical skills to record digits. A fun recording environment was established, where participants would see images of numbers and then recite them. After recording all numbers from 0 to 9, participants received a congratulatory message, encouraging them to continue recording.
To promote participation, an initiative called the African Digits Recording Sprint was launched, lasting for one month. Through advertisements and engagement with communities, native speakers were encouraged to join in. To gather additional information, optional fields were provided for participants to share their age, gender, accent, and country of residence while ensuring that no personal information, like names or addresses, was collected.
The Dataset
Currently, AfroDigits includes 2,185 audio samples across 38 African languages. The dataset is available for download, but requires users to provide some details before accessing it. The dataset is organized into directories, each containing audio files along with metadata that includes audio IDs, language names, and participant information.
In terms of participation, the Oshiwambo language received the most recordings, totaling 1,721. The dataset is structured in a way that allows researchers to directly integrate it into their training processes, making it easier to use in various applications.
Experimental Setup
To showcase the usability of AfroDigits, experiments were conducted with pretrained speech models. The focus was on six African languages: Igbo, Yoruba, Rundi, Oshiwambo, Shona, and Oromo. Each model used in the experiments had different pretraining backgrounds.
Pretrained speech models are neural network models trained on extensive audio datasets. They learn distinct features from sound, which can later be applied to various tasks. In this research, two powerful models were used: Wav2Vec2.0-Large and XLS-R.
The Wav2Vec2.0-Large model was pretrained using audio data from an English-only dataset. In contrast, the XLS-R model utilized a dataset that included audio from 128 different languages, including several African languages. This background led to a belief that XLS-R would perform better in recognizing spoken digits from African languages.
To tackle the challenge of class imbalance, a weighted sampling technique was employed. This ensured that languages with fewer samples were still adequately represented during training, preventing the model from favoring languages with more data.
Results and Discussion
Following the experiments, results were analyzed based on the performance of the models on each language. The XLS-R model generally performed better overall. Moreover, mixing training data from different languages improved results, especially for languages that typically struggled with recognition.
However, despite these advances, certain languages still showed low performance levels, reinforcing the need for more datasets to improve overall recognition. The positive aspect of the results highlighted how using a multilingual approach during training improved outcomes for low-resource languages.
Limitations of AfroDigits
While AfroDigits offers a significant contribution to the available datasets for African languages, the initial dataset size is a concern. Some languages have very few samples, which limits their effectiveness in training models. The project is ongoing, with plans to expand the dataset as more recordings are collected.
AfroDigits stands as a pioneering effort in creating a minimalist, community-driven dataset of spoken digits in African languages. It aims to bridge the gap in existing speech datasets, allowing for broader and more inclusive applications in speech technology. The hope is that as more people engage with the platform, the dataset will continue to grow, offering even more resources for research, education, and practical applications in African languages.
Title: AfroDigits: A Community-Driven Spoken Digit Dataset for African Languages
Abstract: The advancement of speech technologies has been remarkable, yet its integration with African languages remains limited due to the scarcity of African speech corpora. To address this issue, we present AfroDigits, a minimalist, community-driven dataset of spoken digits for African languages, currently covering 38 African languages. As a demonstration of the practical applications of AfroDigits, we conduct audio digit classification experiments on six African languages [Igbo (ibo), Yoruba (yor), Rundi (run), Oshiwambo (kua), Shona (sna), and Oromo (gax)] using the Wav2Vec2.0-Large and XLS-R models. Our experiments reveal a useful insight on the effect of mixing African speech corpora during finetuning. AfroDigits is the first published audio digit dataset for African languages and we believe it will, among other things, pave the way for Afro-centric speech applications such as the recognition of telephone numbers, and street numbers. We release the dataset and platform publicly at https://huggingface.co/datasets/chrisjay/crowd-speech-africa and https://huggingface.co/spaces/chrisjay/afro-speech respectively.
Authors: Chris Chinenye Emezue, Sanchit Gandhi, Lewis Tunstall, Abubakar Abid, Josh Meyer, Quentin Lhoest, Pete Allen, Patrick Von Platen, Douwe Kiela, Yacine Jernite, Julien Chaumond, Merve Noyan, Omar Sanseviero
Last Update: 2023-04-03 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2303.12582
Source PDF: https://arxiv.org/pdf/2303.12582
Licence: https://creativecommons.org/licenses/by-nc-sa/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.