Improving Video Analysis for Social Interactions
A new method enhances the analysis of social interactions in egocentric videos.
― 6 min read
Table of Contents
- The Challenge
- Two Models Approach
- Filtering Input Data
- Initial Model
- Lack of Bounding Box Labels
- Separate Models for Audio and Video
- Enhanced Audio Processing
- Focus on Visual Quality
- Quality-Aware Fusion
- Experimental Setup
- Results
- Comparison with Previous Models
- Data Augmentation Techniques
- Importance of Quality Scores
- Moving Average Post-Processing
- Performance Gaps
- Conclusion
- Original Source
- Reference Links
In recent years, understanding social interactions in videos has become increasingly important, especially for virtual assistants and robots. This article discusses a new approach to analyzing videos where people are talking, focusing on how to combine Audio and visual information effectively.
The Challenge
The task involves identifying social interactions in videos taken from a person’s perspective, known as Egocentric Videos. For example, given a video clip, the goal is to determine if someone in the video is speaking to the person wearing the camera. The data for this task comes from a large dataset that includes numerous videos and audio clips. The challenge lies in processing this information accurately, even when certain labels are missing.
Two Models Approach
To tackle this task, we decided to use two separate models: one for processing the video frames and another for handling the audio. This way, we can make the most of the available training data, even the parts that do not have specific labels for visual elements. By analyzing video and audio separately, we can avoid potential problems that arise when combining them too early in the process.
Filtering Input Data
A crucial element of our approach is the Quality of the input data. To filter out low-quality visual inputs, we use a score derived from a model that predicts facial landmarks. This score helps us to assess how clear and usable the video frames are for training. By focusing on higher-quality images, we can improve the overall performance of our model.
Initial Model
Our first attempt involved an approach called AV-joint, where we combined the audio and video features right after extracting them. This model used powerful networks to analyze both types of data. However, it did not perform better than the basic model we were testing against. This prompted a deeper investigation into why combining the data too early was causing issues.
Lack of Bounding Box Labels
We discovered that a significant amount of training data lacked bounding box labels, which are necessary for identifying where people are located in the frame. This absence complicated our initial method, as it relied on having complete information. While we attempted to address this by filling in the gaps with zeros, this approach did not yield the best results.
Separate Models for Audio and Video
As we continued to experiment, we found that focusing exclusively on audio provided better results than our combined model. This realization led us to process audio and visual information separately. By treating the audio data independently and fully utilizing the available labels, we were able to improve our performance.
Enhanced Audio Processing
For the audio model, we used a strong speech recognition system. This approach took advantage of spoken language to gather important information. The audio model processes clips by transforming them into a visual representation called a Mel spectrogram. This enables the model to capture the essential features of the sound for analysis.
Focus on Visual Quality
On the visual side, the quality of the video frames is essential. The facial landmark model assesses how likely it is to see a face in a given frame. We average these scores across multiple frames to determine whether the data is suitable for training. If the quality score falls below a certain point, we discard that data to ensure higher quality in our training set.
Quality-Aware Fusion
To combine the results from the audio and video models effectively, we introduced a fusion module. This part of the model considers the quality of the visual data when merging predictions from both branches. By applying a weighted system based on the quality scores, we can make more informed decisions with our final predictions.
Experimental Setup
We tested our different model configurations on validation and test data to determine which settings yielded the best performance. The results highlighted the benefits of separating audio and visual processing and using quality filtering effectively.
Results
Our final model, QuAVF, demonstrated strong performance on both validation and test datasets. The separation of audio and visual features proved to be beneficial, as it allowed each model to specialize in its area without negatively impacting the other. The quality-aware fusion provided a significant boost to the final results.
Comparison with Previous Models
In comparing our method to previous approaches, we noted that our QuAVF model surpassed earlier methods in both accuracy and performance metrics. This improvement indicates that the strategies of quality filtering and independent processing are effective ways to enhance outcomes in this field.
Data Augmentation Techniques
For the audio branch, we experimented with various techniques to improve data diversity. One of these methods involved adding noise to the audio, but it did not significantly improve performance. Instead, we found that randomly cropping the audio clips consistently enhanced results across different settings.
Importance of Quality Scores
The facial quality scores were particularly valuable in filtering the visual data. By quantizing these scores and incorporating them as features in our model, we saw significant gains in performance. This shows how crucial good quality data is for training effective models.
Moving Average Post-Processing
In our experiments, we also used a technique called moving average post-processing. This method helps to smooth out predictions by averaging several results over a set window size. This step provided a consistent improvement to our results.
Performance Gaps
Despite achieving high performance on the validation data, we noticed discrepancies when tested on unseen data. This gap suggests that while our model works well on known data, it may not generalize perfectly in different contexts. Future work will be necessary to identify and address these challenges.
Conclusion
Our approach to the problem of identifying social interactions in videos utilizes a separate model for audio and video data, focusing on the quality of each input. This method has demonstrated effective results in analyzing egocentric videos, showing promise for applications in virtual assistants and social robots. The techniques we developed, particularly quality-aware fusion, hold potential for further improvements in this area of research. As technology continues to evolve, refining these methods will be crucial for advancing the way we understand and analyze social interactions through video.
Title: QuAVF: Quality-aware Audio-Visual Fusion for Ego4D Talking to Me Challenge
Abstract: This technical report describes our QuAVF@NTU-NVIDIA submission to the Ego4D Talking to Me (TTM) Challenge 2023. Based on the observation from the TTM task and the provided dataset, we propose to use two separate models to process the input videos and audio. By doing so, we can utilize all the labeled training data, including those without bounding box labels. Furthermore, we leverage the face quality score from a facial landmark prediction model for filtering noisy face input data. The face quality score is also employed in our proposed quality-aware fusion for integrating the results from two branches. With the simple architecture design, our model achieves 67.4% mean average precision (mAP) on the test set, which ranks first on the leaderboard and outperforms the baseline method by a large margin. Code is available at: https://github.com/hsi-che-lin/Ego4D-QuAVF-TTM-CVPR23
Authors: Hsi-Che Lin, Chien-Yi Wang, Min-Hung Chen, Szu-Wei Fu, Yu-Chiang Frank Wang
Last Update: 2023-06-30 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2306.17404
Source PDF: https://arxiv.org/pdf/2306.17404
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.