Simple Science

Cutting edge science explained simply

# Electrical Engineering and Systems Science# Multimedia# Computation and Language# Computer Vision and Pattern Recognition# Machine Learning# Image and Video Processing

Advancing Video-Text Tasks in Indonesian Language

New dataset enhances video-text tasks for Indonesian speakers.

― 7 min read


Boosting IndonesianBoosting IndonesianVideo-Text Modelsmodel advancements.New dataset fuels Indonesian video-text
Table of Contents

Multimodal Learning is important in fields where video and text data are used together. This study focuses on making video and text work together in tasks like finding videos based on text, describing videos in text, and retrieving text from videos. While there are many programs that handle these tasks, most have been made for English. There is a lack of development for other languages, including Indonesian, even though many people speak it. This is likely because there has not been a public dataset available for these tasks in Indonesian.

To change this, we have created the first public dataset for Indonesian video and text. We translated English sentences from a well-known English video-text dataset called MSVD into Indonesian. The new MSVD-Indonesian dataset consists of 1970 videos and around 80,000 sentences. With this dataset, we tested various models that were created for the English dataset on three main tasks: finding videos from text, finding text from videos, and creating captions for videos.

Recent models rely on features from English Datasets. However, there aren't many resources available for training using Indonesian. This raises questions about how effective these models can be for our dataset. To tackle this, we used a technique called cross-lingual transfer learning. This involves using models that were trained on English data and then adjusting them to work with our Indonesian dataset. The results from our tests show that this approach can lead to better results across all tasks.

In conclusion, we believe our dataset, along with the results, will help researchers in the field. It opens new opportunities to advance the study of video and text tasks in Indonesian. The dataset can be found on GitHub.

Overview of Multimodal Machine Learning

Multimodal machine learning combines different types of data, like text, audio, and video, to create more comprehensive models. This growing area is especially important for tasks that link video and text, including retrieving videos based on a text query or generating a text description for a given video.

In text-to-video retrieval, users provide a text prompt, and the system retrieves relevant videos. Video-to-text retrieval works the other way around: users provide a video, and the system finds related text. The goal of Video Captioning is to produce a descriptive sentence for a given video. For all these tasks, a suitable dataset with pairs of videos and text is essential for training effective models.

Most video-text datasets available today are in English. Only a few datasets in other languages, like Chinese or Turkish, exist. Since Indonesian is spoken by many individuals worldwide, the absence of a public dataset in this language limits progress in the research of video-text tasks. Therefore, we set out to create and provide the first public Indonesian video-text dataset by translating the MSVD dataset into Indonesian.

Creation of MSVD-Indonesian Dataset

The original MSVD dataset includes 2089 videos. Some videos were removed from YouTube, so our work only includes 1970 of these videos. We collected 80,827 sentences that accompany these videos from the English version of the dataset and translated them into Indonesian using a translation tool. Each video in the MSVD-Indonesian dataset has the same number of sentences as in the MSVD dataset, allowing for a one-to-one comparison.

Using a translation service may lead to errors. Our translation process resulted in some sentences having incorrect grammar or content. However, many sentences were well-translated, preserving the overall meaning. In cases where the translation was incorrect, we kept the sentences as they were, treating these inaccuracies as noise in our dataset.

Analysis of the Dataset

We compared the MSVD dataset and the MSVD-Indonesian dataset to see how they differ. We observed that certain articles and words frequently used in both datasets show similar patterns. For instance, common articles in both datasets are present, but their frequencies differ due to the language structure.

Furthermore, the number of unique vocabulary words in the MSVD dataset is greater than in the MSVD-Indonesian dataset. The average sentence length in the MSVD dataset is longer than that of the Indonesian dataset. These differences may indicate that a model that excels in the MSVD dataset may not perform as well in the MSVD-Indonesian dataset.

Video-Text Retrieval Tasks

For video-text retrieval, we focused on two primary tasks: text-to-video retrieval and video-to-text retrieval. In both cases, models were able to retrieve relevant videos or texts based on the provided input. We employed a model called X-CLIP, which has proven to be effective in these tasks.

X-CLIP uses a pretrained CLIP model, which has been trained on a large-scale dataset involving images and text. We fine-tuned the X-CLIP model on our Indonesian video-text dataset to determine how well it could perform for both retrieval tasks.

We also analyzed how using a pretrained visual encoder from the English dataset impacts the X-CLIP model’s performance. Results showed that employing the pretrained features significantly boosted performance, even though the text encoder was not specifically tailored for Indonesian.

Video Captioning Task

We also addressed the video captioning task, where the goal is to generate a descriptive sentence for a given video. For this, we applied a model called VNS-GRU, which uses semantic features extracted from the pretrained SCD model. This model was trained on the English version of the MSVD dataset.

Our experiments indicated that using the SCD model helped enhance the generated captions in terms of detail and relevance. Even without direct training on Indonesian data, the model managed to provide relevant and coherent sentences for the videos.

Experimental Results

We evaluated the performance of our models using various metrics to measure their effectiveness in retrieval and captioning tasks. In retrieval tasks, we looked at metrics like recall, which tracks how many relevant items were found in the top search results. For captioning tasks, we assessed how well the generated sentences matched the expected outputs using several standard metrics.

In our study, we found that the pretrained models helped improve the results across all tasks. However, certain configurations or settings were more successful than others. For example, using an optimal number of sample annotations during the training phase yielded better results than using a fixed number.

Future Directions

Our work leaves room for further exploration. There are several avenues researchers can take to enhance the current models and the dataset itself:

  1. Pretraining on Indonesian Data: Future research could focus on creating a large-scale Indonesian vision-language dataset for pretraining models, improving their performance further.

  2. Multilingual Capabilities: Developing models that can produce outputs in multiple languages for each video would be an exciting area to explore, especially since the current dataset has pairs of sentences in English and Indonesian.

  3. Addressing Noise: Investigating the effects of noise within our dataset and developing robust algorithms could lead to better performance and more reliable outputs.

Conclusion

The MSVD-Indonesian dataset represents a significant step forward in multimodal machine learning for the Indonesian language. By creating this dataset, we provide researchers with a valuable resource to develop and test new models for video-text tasks. Our results indicate that existing English-based models can also work effectively on our Indonesian dataset with some adjustments.

We hope that this work will inspire further research and innovation in the field of multimodal learning, leading to a better understanding of video and text relationships in languages beyond English.

Original Source

Title: MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian

Abstract: Multimodal learning on video and text data has been receiving growing attention from many researchers in various research tasks, including text-to-video retrieval, video-to-text retrieval, and video captioning. Although many algorithms have been proposed for those challenging tasks, most of them are developed on English language datasets. Despite Indonesian being one of the most spoken languages in the world, the research progress on the multimodal video-text with Indonesian sentences is still under-explored, likely due to the absence of the public benchmark dataset. To address this issue, we construct the first public Indonesian video-text dataset by translating English sentences from the MSVD dataset to Indonesian sentences. Using our dataset, we then train neural network models which were developed for the English video-text dataset on three tasks, i.e., text-to-video retrieval, video-to-text retrieval, and video captioning. The recent neural network-based approaches to video-text tasks often utilized a feature extractor that is primarily pretrained on an English vision-language dataset. Since the availability of the pretraining resources with Indonesian sentences is relatively limited, the applicability of those approaches to our dataset is still questionable. To overcome the lack of pretraining resources, we apply cross-lingual transfer learning by utilizing the feature extractors pretrained on the English dataset, and we then fine-tune the models on our Indonesian dataset. Our experimental results show that this approach can help to improve the performance for the three tasks on all metrics. Finally, we discuss potential future works using our dataset, inspiring further research in the Indonesian multimodal video-text tasks. We believe that our dataset and our experimental results could provide valuable contributions to the community. Our dataset is available on GitHub.

Authors: Willy Fitra Hendria

Last Update: 2023-06-20 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2306.11341

Source PDF: https://arxiv.org/pdf/2306.11341

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from author

Similar Articles