Advancements in Human Action Recognition Using IMUs

Table of Contents

Combining Visual and Motion Data
The Challenge with Current Systems
Advantages of IMUs
The Need for Integration
Our Approach
Results and Findings
Understanding the Model Architecture
Training and Testing Process
Experiments on Different Datasets
Overcoming Limitations
Performance Comparison with Other Models
Additional Experiments
Conclusion
Original Source
Reference Links

In our world, we gather information through different senses. Most AI systems mainly use visual and text data to understand human actions. However, there's a new way to improve this understanding by using devices called Inertial Measurement Units (IMUs). These devices can track movements but are often tough to work with because the data they collect is not easy to interpret and is sometimes scarce.

Combining Visual and Motion Data

We focus on a method that merges knowledge from visual data and data from IMUs. The core idea is to create a common space that helps in recognizing actions performed by humans, even when one type of data lacks labels. This method is called Fusion and Cross-modal Transfer (Fact). By using this method, we want to train a model that can learn from visual data and then apply that learning to interpret IMU data without needing labeled examples during training.

The Challenge with Current Systems

While humans can learn new movements just by watching someone else, teaching machine learning models to do the same across different types of sensors is not straightforward. Most deep learning systems work with visual and text data because that is what they have in plenty. Continuous use of cameras to gather visual data or of text models to gather information is not always practical, making these systems less effective in real-world applications.

Advantages of IMUs

IMUs gather data like acceleration and rotation from physical devices such as smartwatches and smartphones. They offer a more subtle way to monitor human activities without intruding. Many wearable devices have IMUs built into them. Yet, the potential of these devices is often not fully realized in machine learning due to challenges like limited data and the difficulty of interpreting this data.

The Need for Integration

As different types of sensors become more popular, a pressing question arises: how can we use new sensors alongside older ones when there are no labeled data available? One solution is to use well-documented data from one sensor to enhance knowledge related to the new sensor. This process is known as cross-modal transfer. However, existing techniques mainly depend on having some labeled data for each sensor during training, which is rarely the case.

Our Approach

Our hypothesis is that there exists a hidden structure or space that links various sensor types, allowing for better Human Action Recognition. We explore different ways to create this structure and see if it can help transfer learning from one sensor to another, even without any labels for the second sensor.

In our method, called FACT, we test it using data from both RGB (color) videos and IMU sensors from four different datasets. During training, we use labeled data from RGB videos and unlabeled data from IMUs. The goal is to see if the model can learn to recognize actions from IMU data when tested later.

Results and Findings

Our experiments show that the FACT method performs significantly better than existing methods when recognizing actions from IMU data without prior labels. The tests also show that the model can understand actions just by looking at IMU data, demonstrating cross-modal transfer capabilities.

Understanding the Model Architecture

The structure of FACT is designed to allow different components to work together during training. This flexibility means that we can easily adapt it for different types of sensors and tasks. The model comprises three main parts:

Video Feature Encoder: This processes video frames using a standard network, extracting key features.
IMU Feature Encoder: This uses a one-dimensional convolutional network to analyze IMU data.
HAR Task Decoder: This module takes the extracted features and predicts the action being performed.

We also developed a time-aware version of FACT called T-FACT, which takes time into account when aligning and combining data from different sensors.

Training and Testing Process

The training of the model consists of two steps:

Learning from labeled RGB data to establish a human action recognition (HAR) model.
Aligning the representations from RGB and IMU data to improve cross-modal transfer.

When testing, the model needs to predict actions from IMU data alone, without having seen these labels during training.

Experiments on Different Datasets

We conduct tests using several datasets, including UTD-MHAD, CZU-MHAD, MMACT, and MMEA-CL. Each of these datasets provides unique challenges and helps us gauge the effectiveness of the FACT method across diverse scenarios.

UTD-MHAD: This dataset has multiple types of data, such as RGB, skeletal, depth, and IMU. It helps validate how well FACT can work with real-world data.
CZU-MHAD: This dataset is more controlled and allows for better measurement of the model’s performance due to the consistent environment.
MMACT: A larger dataset that includes various scenes where actions occur, making predictions trickier.
MMEA-CL: Focused on everyday actions, this dataset tests the model's adaptability to different activities.

Overcoming Limitations

Although many studies focus on dealing with missing data during training or testing, few address the situation where no labeled data is available from one type of sensor. This gap makes it complex to establish baseline methods.

We developed baseline methods, such as student-teacher models, which usually require labeled data from both sensors. Our approach differs since FACT can operate without labels from one sensor, using data to find relationships between them.

Performance Comparison with Other Models

Existing sensor fusion models are good at handling incomplete data but don't handle the case of having zero labeled data during training well. We showed that these models struggle in comparison to FACT, which can utilize knowledge from labeled data on one sensor to inform the other.

We also looked at contrastive learning methods, specifically how well these could perform on our data. Some models, like ImageBind, did not work effectively with IMU data, particularly since this approach was designed for different tasks.

Additional Experiments

To ensure the effectiveness of FACT, we conducted various experiments to tweak and understand its performance better. We looked into how the model performs under different conditions, analyzing its robustness and adaptability in various settings.

We performed ablation studies to identify which training method produces the best results, determining the best way to align and train the model.

Conclusion

Through our research, we've discovered a promising method for transferring knowledge between different sensor types, particularly from visual data to IMUs. Our approach, FACT, demonstrates significant capabilities, even in zero-labeled training scenarios, and shows promise for practical applications in everyday technology, such as wearables and smart devices.

By creating a way to efficiently integrate various sensing modalities, FACT aims to enhance how AI understands human actions in real-world settings. In doing so, we lay the groundwork for future work in this area, opening doors to new advancements in machine learning and its applications.

Advancements in Human Action Recognition Using IMUs

Combining Visual and Motion Data

The Challenge with Current Systems

Advantages of IMUs

The Need for Integration

Our Approach

Results and Findings

Understanding the Model Architecture

Training and Testing Process

Experiments on Different Datasets

Overcoming Limitations

Performance Comparison with Other Models

Additional Experiments

Conclusion

Reference Links

Referenced Topics

Similar Articles

Advancements in Human Action Recognition Using IMUs

#Combining Visual and Motion Data

#The Challenge with Current Systems

#Advantages of IMUs

#The Need for Integration

#Our Approach

#Results and Findings

#Understanding the Model Architecture

#Training and Testing Process

#Experiments on Different Datasets

#Overcoming Limitations

#Performance Comparison with Other Models

#Additional Experiments

#Conclusion

Reference Links

Referenced Topics

Similar Articles

Combining Visual and Motion Data

The Challenge with Current Systems

Advantages of IMUs

The Need for Integration

Our Approach

Results and Findings

Understanding the Model Architecture

Training and Testing Process

Experiments on Different Datasets

Overcoming Limitations

Performance Comparison with Other Models

Additional Experiments

Conclusion