Detecting Human Interactions in Video
A new method for analyzing interactions between people in various settings.
― 7 min read
Table of Contents
Detecting interactions between people in videos is very important for security and understanding social behavior. This is especially true in busy places like parks, schools, and public squares. Traditional methods usually look at staged videos with rehearsed actions, which is not very helpful for real-life situations where multiple groups of people are interacting at the same time.
To address this issue, we introduce a new method called Human-to-Human Interaction Detection (HID). This method looks at detecting people, identifying what each person is doing, and grouping people based on how they are interacting with one another, all within a single approach.
The AVA-Interaction Dataset
To carry out our work, we created a new dataset called AVA-Interaction (AVA-I). This dataset builds on an existing one known as the AVA dataset, which has a lot of videos showing people performing different actions. We expanded this dataset by adding detailed notes about how people interact with each other frame by frame, leading to a total of over 85,000 frames and more than 86,000 interaction groups.
The interactions in this dataset include both normal actions like handshakes and hugs, as well as abnormal actions like fighting and chasing. Each frame can show up to four groups of people interacting at the same time. This level of detail makes AVA-I a strong resource for studying how people interact in various situations.
Why HID is Important
Understanding human interactions from video feeds is critical for several reasons. One key reason is for security purposes. Systems need to identify if something suspicious is happening, like a fight or theft, quickly and accurately.
Current methods often simplify the task by either classifying images or videos without considering the complexity of multiple interactions happening at once. They fail to answer important questions about who is involved in each interaction, what actions they are performing, and how they relate to each other.
Some newer methods try to solve this issue by first detecting people and then analyzing their interactions, but this two-step process often leads to less accurate results, especially in crowded situations. Additionally, existing datasets used for training these methods are typically limited and focus on simple, staged interactions.
Given these challenges, we propose HID as a new task, along with AVA-I, to provide a more realistic benchmark for future research in this area.
The SaMFormer Approach
To achieve HID, we developed a new model called SaMFormer. This model uses a combination of advanced visual processing techniques to detect people, recognize their actions, and understand how they group together during interactions.
SaMFormer consists of three main parts: a Feature Extractor, a split stage, and a merging stage.
Feature Extractor: The feature extractor processes the video frames to create a detailed representation of the motion and interactions taking place. This provides the necessary context for the model to analyze what is happening in each frame.
Split Stage: In the split stage, we use two separate sets of queries to predict people and interaction groups. This allows us to capture individual actions while also recognizing how different people are grouped based on their interactions.
Merging Stage: Finally, the merging stage combines the information from the previous stages to clarify how individuals relate to each other within groups. This helps to better differentiate between different types of interactions.
By training SaMFormer to work with AVA-I, we can efficiently detect and analyze interactions among multiple people in various situations.
Evaluation Metrics
To measure the effectiveness of our model, we use several metrics. For assessing individual actions and detecting people, we apply mean average precision. For measuring how well we group people together based on their interactions, we use a new metric called group average precision.
These metrics help us understand how well our model performs in real-world situations where multiple interactions occur simultaneously.
Results and Findings
Our experiments show that SaMFormer significantly outperforms existing methods for detecting human interactions in videos. By using the new AVA-I dataset, SaMFormer consistently demonstrates better accuracy in identifying individual actions and group interactions compared to previous approaches.
We also found that combining spatial and semantic information is crucial for accurately predicting interactions. This means that not only the position of individuals but also the context of their actions plays a significant role in understanding how they relate to each other.
Through extensive testing, we found that SaMFormer is particularly effective in crowded environments where people might be interacting in complex ways. However, there were still cases where the model struggled, especially in situations with heavy occlusions or unclear interactions.
Related Work
To provide context for our work, it's important to mention closely related tasks in the field. Action detection, for instance, aims to locate human actions in videos, but it often ignores the interactive relationships between individuals.
Human interaction understanding focuses on identifying actions and pairs of interactions but typically requires that bounding boxes for people be detected beforehand. In contrast, HID considers both individual actions and how people work together within groups.
Social relation recognition deals with identifying the social dynamics present in images, but again, it doesn't offer the detailed understanding of interactions that HID aims to achieve.
The Need for New Datasets
One of the major challenges in developing HID techniques has been the availability of suitable datasets. Existing datasets are often small and focus on simple, choreographed interactions. They lack the complexity and realism found in everyday life, which makes training effective models difficult.
By creating AVA-I, we hope to provide a comprehensive resource that includes a wide variety of complex interactions in real-world settings. This will be essential for training and evaluating future models aimed at detecting and understanding human interactions in videos.
Training and Implementation
For our training process, we followed best practices in the field. We used popular optimization techniques and carefully selected training sets to ensure our model learns efficiently. Throughout the training, we monitored performance and made adjustments to maximize accuracy.
SaMFormer was designed to be as efficient as possible while still delivering high-quality results. This involved balancing different components of the model to achieve the best combination of speed and accuracy.
Qualitative Analysis
To illustrate the effectiveness of our approach, we conducted a qualitative analysis comparing SaMFormer to other models. In various scenarios, SaMFormer demonstrated a superior ability to accurately recognize interactions among multiple individuals, while other models often failed to do so.
In cases where occlusions occurred or interactions were particularly complex, we performed detailed examinations to understand how well each model handled these challenges. While SaMFormer performed well in many situations, there were instances where it misidentified groupings due to overlapping individuals or unclear cues.
Future Directions
Looking ahead, the introduction of HID as a new task offers many exciting possibilities for future research. We believe that using AVA-I as a benchmark will encourage further advancements in understanding human interactions in various contexts.
Future work could focus on refining models like SaMFormer, improving their ability to handle occluded interactions, and examining how these techniques can be applied in real-world security and behavioral analysis scenarios.
Moreover, expanding the AVA-I dataset to include even more diverse interactions and complex situations will be crucial for ongoing development in this field. As more researchers explore HID, we anticipate great strides in how we understand and analyze human behavior in video content.
Conclusion
Human-to-Human Interaction Detection is an essential task with valuable applications in security and social analysis. By developing the AVA-I dataset and the SaMFormer model, we have taken significant steps toward enhancing how we detect and interpret interactions in videos.
Our findings show the importance of both spatial and contextual information in accurately predicting interactions, and we look forward to seeing how this work influences future research and applications in the field. By continuing to refine our methods and datasets, we can improve our understanding of the rich tapestry of human interactions and their significance in various settings.
Title: Human-to-Human Interaction Detection
Abstract: A comprehensive understanding of interested human-to-human interactions in video streams, such as queuing, handshaking, fighting and chasing, is of immense importance to the surveillance of public security in regions like campuses, squares and parks. Different from conventional human interaction recognition, which uses choreographed videos as inputs, neglects concurrent interactive groups, and performs detection and recognition in separate stages, we introduce a new task named human-to-human interaction detection (HID). HID devotes to detecting subjects, recognizing person-wise actions, and grouping people according to their interactive relations, in one model. First, based on the popular AVA dataset created for action detection, we establish a new HID benchmark, termed AVA-Interaction (AVA-I), by adding annotations on interactive relations in a frame-by-frame manner. AVA-I consists of 85,254 frames and 86,338 interactive groups, and each image includes up to 4 concurrent interactive groups. Second, we present a novel baseline approach SaMFormer for HID, containing a visual feature extractor, a split stage which leverages a Transformer-based model to decode action instances and interactive groups, and a merging stage which reconstructs the relationship between instances and groups. All SaMFormer components are jointly trained in an end-to-end manner. Extensive experiments on AVA-I validate the superiority of SaMFormer over representative methods. The dataset and code will be made public to encourage more follow-up studies.
Authors: Zhenhua Wang, Kaining Ying, Jiajun Meng, Jifeng Ning
Last Update: 2023-08-11 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2307.00464
Source PDF: https://arxiv.org/pdf/2307.00464
Licence: https://creativecommons.org/licenses/by-nc-sa/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.