Dynamic Multi-Agent System for Video Question Answering

Table of Contents

The EgoSchema Challenge
Our Contribution
How Our Approach Works
Stage 1: Dynamic Agent Creation
Stage 2: Question Answering with Multiple Agents
Results
Accuracy of Our Approach
Ablation Study
Experiment 1: Multi-Agent vs. Single-Agent
Experiment 2: Domain Experts vs. AI Assistants
Experiment 3: Frame Number Variation
Conclusion
Original Source
Reference Links

Video Question Answering (VQA) is a task that involves answering questions based on video clips. The EgoSchema Challenge 2024 focuses on this by providing a dataset with over 5,000 questions related to various video clips. Each question has five answer choices, and the challenge is to find the best response.

We propose a new approach called Video Question Answering with Dynamically Generated Multi-Agents (VDMA). This approach uses a system of multiple agents that are created on the fly, with each one having specific expertise to answer questions accurately. This method works alongside existing systems and aims to improve the quality of responses.

The EgoSchema Challenge

EgoSchema is a dataset designed for long-form Video Question Answering tasks. It includes questions that cover various aspects such as the purpose of actions in the video, how tools are used, and identifying key actions. With this dataset, the challenge is to provide accurate and context-sensitive responses to the questions.

In recent years, different methods have been suggested to tackle these challenges. Some methods use image descriptions to generate answers, while others rely on systems that utilize agents to gather relevant information. Recent work with Large Language Models (LLMs) has also tried using debates among agents to improve answer quality. Our strategy builds on these previous studies by introducing a framework that consists of multiple expert agents tailored for the VQA task.

Our Contribution

Our contribution consists of two main parts:

Multi-agent Framework: We propose a system that consists of two stages: dynamic agent creation and question answering with multiple expert agents.
Performance Results: We tested our method and achieved an accuracy of 70.7% on the EgoSchema dataset. Our results show that using multiple agents is more effective than relying on a single agent.

How Our Approach Works

The VDMA system is built on two main stages.

Stage 1: Dynamic Agent Creation

In the first stage, we analyze the video content and the text of the question to identify the right experts who can provide insights. We generate prompts that outline what each expert agent should know to answer the question. This method allows for a tailored response based on the specific context of the video and question, improving accuracy.

Stage 2: Question Answering with Multiple Agents

In the second stage, we employ the agents we created in the first stage to answer the questions. Each expert agent uses the specific knowledge related to the question and video to form a response. There’s also an organizer agent that combines the input from all experts and decides on the final answer.

Each agent has access to two tools to help them analyze the video and the question: one tool provides information from image captions, and the other is for deeper video analysis. The agents select the best tool based on the specific question and then interpret the video, share their best answer, and explain their reasoning.

The organizer then reviews the answers from all agents and consolidates them into one final answer.

Results

We evaluated our method using the EgoSchema dataset, which involves answering questions about a three-minute video clip. Each question has five possible answers to choose from. Our model picks the answer that best matches the question.

To improve our accuracy even further, we used an ensemble method that involved five different models, including our main approach. The ensemble method works by gathering votes from each model to decide on the final answer. Even though this voting method is quite simple, it has proven to enhance accuracy significantly.

Accuracy of Our Approach

When we compared the performance of our multi-agent system against other methods, our multi-agent approach showed better accuracy. For instance, models that used multiple agents had higher success rates than those that didn’t.

In our tests, we found that having three expert agents provided better results than just two. However, when we instructed the organizer to provide shorter responses when uncertain, we saw a slight decrease in accuracy.

After applying the ensemble method, we achieved an overall accuracy rate of 70.7%, which was higher than any individual model.

Ablation Study

To further assess our method's effectiveness, we conducted an ablation study. This meant we tested different parts of our approach to see how they impacted overall performance. We focused on three aspects:

Comparing the performance of our multi-agent system with a single-agent method.
Evaluating the role of dynamically created domain experts in our process.
Examining the effect of using different numbers of video frames during analysis.

Experiment 1: Multi-Agent vs. Single-Agent

We compared the multi-agent system against a single-agent approach. The results showed that our multi-agent method performed slightly better, with an accuracy of 73.2% compared to 72.8% for the single agent.

The advantage of having multiple agents is that it brings different viewpoints and specialist knowledge, which can help clarify tough questions and improve final answers.

Experiment 2: Domain Experts vs. AI Assistants

Next, we looked at how well our dynamically generated experts performed compared to using general AI assistants for all agents. Our findings indicated that using specialized experts achieved better accuracy (73.2%) than the uniform AI assistants (72.6%).

Having experts who could focus on specific questions allowed for more accurate and relevant responses, showing the benefit of expert knowledge.

Experiment 3: Frame Number Variation

In the last study, we tested how changing the number of video frames used for analysis affected performance. We compared using 18 frames with 90 frames. Generally, using more frames improved performance, especially in analyzing action sequences.

However, analyzing more frames also made character interactions harder to assess since they make up a smaller part of the video. This indicates the need for careful selection of frames to focus on the most relevant segments, which could lead to better analysis outcomes.

Conclusion

In this article, we introduced the VDMA for long-form Video Question Answering. Our method effectively achieved a 70.7% accuracy rate on the EgoSchema dataset. It shows that using a system of dynamically generated multi-agents is more effective than relying on a single agent to answer questions by tapping into various areas of expertise.

Our approach relies on multiple stages and agents, which does increase the computational cost compared to single-agent systems. However, the increase in accuracy is a substantial advantage. Recent developments in LLMs have also made concerns about the computational performance of such systems less pressing.

In future work, it might be beneficial to allow agents to debate until a consensus is reached, which could further enhance the accuracy of responses. The choice of tools used by agents plays a critical role in performance, and ongoing improvements in these tools could lead to even better results.

Dynamic Multi-Agent System for Video Question Answering

The EgoSchema Challenge

Our Contribution

How Our Approach Works

Stage 1: Dynamic Agent Creation

Stage 2: Question Answering with Multiple Agents

Results

Accuracy of Our Approach

Ablation Study

Experiment 1: Multi-Agent vs. Single-Agent

Experiment 2: Domain Experts vs. AI Assistants

Experiment 3: Frame Number Variation

Conclusion

Reference Links

Referenced Topics

Similar Articles

Dynamic Multi-Agent System for Video Question Answering

#The EgoSchema Challenge

#Our Contribution

#How Our Approach Works

#Stage 1: Dynamic Agent Creation

#Stage 2: Question Answering with Multiple Agents

#Results

#Accuracy of Our Approach

#Ablation Study

#Experiment 1: Multi-Agent vs. Single-Agent

#Experiment 2: Domain Experts vs. AI Assistants

#Experiment 3: Frame Number Variation

#Conclusion

Reference Links

Referenced Topics

Similar Articles

The EgoSchema Challenge

Our Contribution

How Our Approach Works

Stage 1: Dynamic Agent Creation

Stage 2: Question Answering with Multiple Agents

Results

Accuracy of Our Approach

Ablation Study

Experiment 1: Multi-Agent vs. Single-Agent

Experiment 2: Domain Experts vs. AI Assistants

Experiment 3: Frame Number Variation

Conclusion