Advancing Video-Text Tasks in Indonesian Language

Table of Contents

Overview of Multimodal Machine Learning
Creation of MSVD-Indonesian Dataset
Analysis of the Dataset
Video-Text Retrieval Tasks
Video Captioning Task
Experimental Results
Future Directions
Conclusion
Original Source
Reference Links

Multimodal Learning is important in fields where video and text data are used together. This study focuses on making video and text work together in tasks like finding videos based on text, describing videos in text, and retrieving text from videos. While there are many programs that handle these tasks, most have been made for English. There is a lack of development for other languages, including Indonesian, even though many people speak it. This is likely because there has not been a public dataset available for these tasks in Indonesian.

To change this, we have created the first public dataset for Indonesian video and text. We translated English sentences from a well-known English video-text dataset called MSVD into Indonesian. The new MSVD-Indonesian dataset consists of 1970 videos and around 80,000 sentences. With this dataset, we tested various models that were created for the English dataset on three main tasks: finding videos from text, finding text from videos, and creating captions for videos.

Recent models rely on features from English Datasets. However, there aren't many resources available for training using Indonesian. This raises questions about how effective these models can be for our dataset. To tackle this, we used a technique called cross-lingual transfer learning. This involves using models that were trained on English data and then adjusting them to work with our Indonesian dataset. The results from our tests show that this approach can lead to better results across all tasks.

In conclusion, we believe our dataset, along with the results, will help researchers in the field. It opens new opportunities to advance the study of video and text tasks in Indonesian. The dataset can be found on GitHub.

Overview of Multimodal Machine Learning

Multimodal machine learning combines different types of data, like text, audio, and video, to create more comprehensive models. This growing area is especially important for tasks that link video and text, including retrieving videos based on a text query or generating a text description for a given video.

In text-to-video retrieval, users provide a text prompt, and the system retrieves relevant videos. Video-to-text retrieval works the other way around: users provide a video, and the system finds related text. The goal of Video Captioning is to produce a descriptive sentence for a given video. For all these tasks, a suitable dataset with pairs of videos and text is essential for training effective models.

Most video-text datasets available today are in English. Only a few datasets in other languages, like Chinese or Turkish, exist. Since Indonesian is spoken by many individuals worldwide, the absence of a public dataset in this language limits progress in the research of video-text tasks. Therefore, we set out to create and provide the first public Indonesian video-text dataset by translating the MSVD dataset into Indonesian.

Creation of MSVD-Indonesian Dataset

The original MSVD dataset includes 2089 videos. Some videos were removed from YouTube, so our work only includes 1970 of these videos. We collected 80,827 sentences that accompany these videos from the English version of the dataset and translated them into Indonesian using a translation tool. Each video in the MSVD-Indonesian dataset has the same number of sentences as in the MSVD dataset, allowing for a one-to-one comparison.

Using a translation service may lead to errors. Our translation process resulted in some sentences having incorrect grammar or content. However, many sentences were well-translated, preserving the overall meaning. In cases where the translation was incorrect, we kept the sentences as they were, treating these inaccuracies as noise in our dataset.

Analysis of the Dataset

We compared the MSVD dataset and the MSVD-Indonesian dataset to see how they differ. We observed that certain articles and words frequently used in both datasets show similar patterns. For instance, common articles in both datasets are present, but their frequencies differ due to the language structure.

Furthermore, the number of unique vocabulary words in the MSVD dataset is greater than in the MSVD-Indonesian dataset. The average sentence length in the MSVD dataset is longer than that of the Indonesian dataset. These differences may indicate that a model that excels in the MSVD dataset may not perform as well in the MSVD-Indonesian dataset.

Video-Text Retrieval Tasks

For video-text retrieval, we focused on two primary tasks: text-to-video retrieval and video-to-text retrieval. In both cases, models were able to retrieve relevant videos or texts based on the provided input. We employed a model called X-CLIP, which has proven to be effective in these tasks.

X-CLIP uses a pretrained CLIP model, which has been trained on a large-scale dataset involving images and text. We fine-tuned the X-CLIP model on our Indonesian video-text dataset to determine how well it could perform for both retrieval tasks.

We also analyzed how using a pretrained visual encoder from the English dataset impacts the X-CLIP model’s performance. Results showed that employing the pretrained features significantly boosted performance, even though the text encoder was not specifically tailored for Indonesian.

Video Captioning Task

We also addressed the video captioning task, where the goal is to generate a descriptive sentence for a given video. For this, we applied a model called VNS-GRU, which uses semantic features extracted from the pretrained SCD model. This model was trained on the English version of the MSVD dataset.

Our experiments indicated that using the SCD model helped enhance the generated captions in terms of detail and relevance. Even without direct training on Indonesian data, the model managed to provide relevant and coherent sentences for the videos.

Experimental Results

We evaluated the performance of our models using various metrics to measure their effectiveness in retrieval and captioning tasks. In retrieval tasks, we looked at metrics like recall, which tracks how many relevant items were found in the top search results. For captioning tasks, we assessed how well the generated sentences matched the expected outputs using several standard metrics.

In our study, we found that the pretrained models helped improve the results across all tasks. However, certain configurations or settings were more successful than others. For example, using an optimal number of sample annotations during the training phase yielded better results than using a fixed number.

Future Directions

Our work leaves room for further exploration. There are several avenues researchers can take to enhance the current models and the dataset itself:

Pretraining on Indonesian Data: Future research could focus on creating a large-scale Indonesian vision-language dataset for pretraining models, improving their performance further.
Multilingual Capabilities: Developing models that can produce outputs in multiple languages for each video would be an exciting area to explore, especially since the current dataset has pairs of sentences in English and Indonesian.
Addressing Noise: Investigating the effects of noise within our dataset and developing robust algorithms could lead to better performance and more reliable outputs.

Conclusion

The MSVD-Indonesian dataset represents a significant step forward in multimodal machine learning for the Indonesian language. By creating this dataset, we provide researchers with a valuable resource to develop and test new models for video-text tasks. Our results indicate that existing English-based models can also work effectively on our Indonesian dataset with some adjustments.

We hope that this work will inspire further research and innovation in the field of multimodal learning, leading to a better understanding of video and text relationships in languages beyond English.

Advancing Video-Text Tasks in Indonesian Language

New dataset enhances video-text tasks for Indonesian speakers.

Overview of Multimodal Machine Learning

Creation of MSVD-Indonesian Dataset

Analysis of the Dataset

Video-Text Retrieval Tasks

Video Captioning Task

Experimental Results

Future Directions

Conclusion

Reference Links

Referenced Topics

Advancing Video-Text Tasks in Indonesian Language

New dataset enhances video-text tasks for Indonesian speakers.

#Overview of Multimodal Machine Learning

#Creation of MSVD-Indonesian Dataset

#Analysis of the Dataset

#Video-Text Retrieval Tasks

#Video Captioning Task

#Experimental Results

#Future Directions

#Conclusion

Reference Links

Referenced Topics

Overview of Multimodal Machine Learning

Creation of MSVD-Indonesian Dataset

Analysis of the Dataset

Video-Text Retrieval Tasks

Video Captioning Task

Experimental Results

Future Directions

Conclusion