Improving Action Recognition in Drone Videos
This article presents a method to enhance action recognition in UAV videos.
― 4 min read
Table of Contents
Action Recognition in videos captured by drones (UAVs) poses unique challenges due to the high altitude and changing perspectives. These factors can lead to difficulties in seeing the full details of human actions. In this article, we discuss a new method developed to improve action recognition in drone videos, focusing on how to identify and align important features over time.
The Challenge with UAV Videos
Drones capture videos from a high viewpoint, which makes people appear very small in relation to the background. This means that only a tiny portion of the video frame shows the actual action being performed, often less than 10%. The angle and height of the camera can also change drastically from frame to frame. As a result, action recognition models may end up paying more attention to background changes rather than the actual movements of the people.
Additionally, collecting and labeling videos from drones is more difficult than doing so for ground cameras. There are fewer datasets available for training models, which makes it harder to improve their performance. The varying camera angles and altitudes also create a diverse range of video conditions that complicate the recognition process.
Our Approach to Action Recognition
The method we propose focuses on two main ideas: aligning important features of a person's actions over time and selecting the most informative frames from the video. By using these techniques together, our model can better learn from the movements of individuals and provide more accurate action recognition.
Feature Alignment
To tackle the challenge of recognizing actions in UAV videos, the first step is to locate the person in the video and crop the relevant region. We then use a method to align the features that correspond to the person's actions. By aligning these features, we can ensure that the recognition model learns from the movement of the person rather than distractions from the background.
The alignment process takes into account the changes in the person's position between frames. By maximizing the information shared between the frames, we can improve the chances of accurately identifying the actions being performed.
Frame Sampling
In addition to aligning the features, we also developed a new way to select frames for training. Not all frames in a video are equally useful for training a recognition model. Some frames may contain redundant information or may not show the person clearly. Our method selects the most informative frames that show significant changes in the person's actions.
By focusing on these more informative frames, we can reduce noise and improve the performance of our recognition model. The idea is to pick frames that offer distinct information, enhancing the model's ability to learn effectively from the details of the actions being performed.
Results of Our Method
After implementing our approach, we tested it across three publicly available UAV video datasets. The results showed improvements in accuracy compared to existing methods. In particular, our technique achieved significant enhancements in the identification of human actions, outperforming state-of-the-art methods.
Performance on Datasets
UAV-Human Dataset: This dataset is known for its challenging scenarios, with various actions performed in different environments. Our method showed an impressive improvement in accuracy over other techniques in recognizing actions.
Drone-Action Dataset: In this dataset, our approach demonstrated better performance than previous methods, achieving high accuracy in recognizing different human actions.
NEC Drones Dataset: Our method also excelled in this dataset, which contains videos captured indoors. We achieved a strong accuracy that surpassed current leading methods.
Understanding the Components
Temporal Feature Alignment
The goal of temporal feature alignment is to match the key features over time in a video. This allows the model to recognize the action being performed despite any distractions from the background. Rather than relying on complex skeletal data, our approach focuses on the pixel-level features that provide a clearer representation of the action.
Mutual Information
Mutual information is a central concept in our approach. It measures how much knowing one variable reduces uncertainty about another. In the context of our method, mutual information helps determine how well two images relate to each other. By maximizing the mutual information between frames, we can identify which frames carry the most relevant information.
Conclusion
In conclusion, our proposed approach for action recognition in UAV videos successfully addresses several challenges presented by high-altitude and dynamic filming conditions. By aligning features across frames and carefully selecting informative frames, we have improved the accuracy of action recognition.
Our work opens up new possibilities for enhanced understanding of human actions in aerial videos and paves the way for future improvements, including the potential to handle more complex scenarios involving multiple people or actions. By continuing to refine these methods, we can further advance the field of action recognition in drone-captured videos.
Title: MITFAS: Mutual Information based Temporal Feature Alignment and Sampling for Aerial Video Action Recognition
Abstract: We present a novel approach for action recognition in UAV videos. Our formulation is designed to handle occlusion and viewpoint changes caused by the movement of a UAV. We use the concept of mutual information to compute and align the regions corresponding to human action or motion in the temporal domain. This enables our recognition model to learn from the key features associated with the motion. We also propose a novel frame sampling method that uses joint mutual information to acquire the most informative frame sequence in UAV videos. We have integrated our approach with X3D and evaluated the performance on multiple datasets. In practice, we achieve 18.9% improvement in Top-1 accuracy over current state-of-the-art methods on UAV-Human(Li et al., 2021), 7.3% improvement on Drone-Action(Perera et al., 2019), and 7.16% improvement on NEC Drones(Choi et al., 2020).
Authors: Ruiqi Xian, Xijun Wang, Dinesh Manocha
Last Update: 2023-11-15 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2303.02575
Source PDF: https://arxiv.org/pdf/2303.02575
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.