Using Video Analysis for Home Exercise Evaluation
Leveraging online fitness videos to assess exercise performance.
― 5 min read
Table of Contents
As technology grows, using video to assess how people exercise at home is gaining attention. However, one big challenge is that there aren’t many collections of labeled exercise Videos to train systems that evaluate performance. To tackle this, we can use the many fitness videos available online, especially on platforms like YouTube, where they not only demonstrate Exercises but also include spoken instructions and tips.
Importance of Exercise Evaluation
Physical therapy is important for treating injuries and health conditions. It helps people recover, avoid disabilities, and maintain independence. While having a professional guide exercises is ideal, many individuals cannot access regular training sessions. Home workouts can still be effective without expert supervision. However, if someone performs exercises incorrectly, it may lead to injury or hinder recovery. An automated system that evaluates exercise performance can help reduce these risks by providing real-time feedback.
Utilizing Video Data
With improvements in technology related to computer vision and motion tracking, many researchers focus on understanding human movement through videos. A common approach uses tools like MediaPipe Pose, which captures data about body positions from videos. Some systems provide feedback on how well a person does an exercise based on this pose data, but they rely on existing labeled datasets, which are limited.
To overcome the challenge of lacking labeled data, we suggest using the vast number of fitness videos on social media sites. These videos, created by various individuals including trainers and enthusiasts, often feature both Correct and incorrect exercise demonstrations along with descriptions and tips through Subtitles.
Methodology
For our study, we chose push-ups as a simple yet effective exercise example. Our approach involves analyzing subtitles from YouTube videos using natural language processing (NLP) to categorize parts of the videos into relevant and irrelevant sections. We aim to label the relevant clips as correct or incorrect and create a sizable dataset for training machine learning systems.
Video Search Criteria
We searched YouTube using specific phrases related to push-ups. This included terms like "learn push up," "push up mistakes," and "correct push up." After filtering through the results to ensure they included English subtitles and removing duplicates, we ended up with 67 relevant videos.
Classification of Subtitles
Initially, we scanned the subtitles to find broad irrelevant sections using keywords and anti-keywords related to push-ups. If we found terms indicating unrelated exercises, we marked those sections as irrelevant. After that, we broke the remaining text into sentences for a deeper analysis.
Then, we utilized another set of keywords and anti-keywords to classify sentences based on whether they focused on push-ups or other topics. We assumed that relevant dialogue would coincide with the visual of someone performing the push-up. To further improve accuracy, we checked for full-body visibility in each video using MediaPipe Pose.
Assessing Exercise Quality
After identifying relevant sections, we classified whether the push-ups were performed correctly or incorrectly. To do this, we trained a model on a set of manually labeled sentences, allowing it to learn the differences between correct and incorrect form.
For the incorrectly executed push-ups, we generated descriptive labels that detail common mistakes. Instead of summarizing multiple sentences, we extracted key phrases directly from individual sentences to highlight errors accurately.
Segmenting Videos
Using the methods we devised, we segmented the total of 67 videos into three categories: irrelevant clips, correct clips, and incorrect clips. The goal was to use this organized data for future training of a classification system based on computer vision techniques.
Results
Visibility Observations
We began our evaluation by analyzing the visibility of body landmarks in different video clips. We noticed that clips marked as irrelevant generally had lower visibility for crucial body parts compared to relevant clips. This suggests that poor visibility is associated with non-relevant content.
Clustering Analysis
Next, we looked at clustering frames from all clips to identify patterns. By analyzing these clusters, we found that the frames marked as relevant, whether correct or incorrect, mostly showed individuals in proper push-up positions. In contrast, irrelevant clips showed individuals facing the camera with improper positioning.
When we examined the clusters separately for each category, we observed further distinctions. Clips deemed relevant and correct generally showed standard push-up positions, while the incorrect clips displayed various forms of mistakes. This clustering provided a clearer vision of how the data could differentiate between correct and incorrect exercises.
Conclusion
Our study highlights the potential of using video analysis combined with natural language processing to generate labeled data for understanding human movements in exercise videos. While there are challenges, such as inaccuracies in subtitle synchronization and the complexity of assessing correctness, our work sets a solid foundation for future exploration.
Moving forward, we plan to refine our clustering techniques and incorporate more advanced motion analysis to enhance the quality of our Classifications. We also aim to explore more complex language models that can improve understanding in different contexts. By developing biomechanical models, we hope to detect incorrect movements more effectively and contribute to better training and rehabilitation practices.
This approach to segmenting and categorizing online exercise videos based solely on subtitle analysis demonstrates a promising pathway for future research in the field of exercise science and technology.
Title: Automatic Generation of Labeled Data for Video-Based Human Pose Analysis via NLP applied to YouTube Subtitles
Abstract: With recent advancements in computer vision as well as machine learning (ML), video-based at-home exercise evaluation systems have become a popular topic of current research. However, performance depends heavily on the amount of available training data. Since labeled datasets specific to exercising are rare, we propose a method that makes use of the abundance of fitness videos available online. Specifically, we utilize the advantage that videos often not only show the exercises, but also provide language as an additional source of information. With push-ups as an example, we show that through the analysis of subtitle data using natural language processing (NLP), it is possible to create a labeled (irrelevant, relevant correct, relevant incorrect) dataset containing relevant information for pose analysis. In particular, we show that irrelevant clips ($n=332$) have significantly different joint visibility values compared to relevant clips ($n=298$). Inspecting cluster centroids also show different poses for the different classes.
Authors: Sebastian Dill, Susi Zhihan, Maurice Rohr, Maziar Sharbafi, Christoph Hoog Antink
Last Update: 2023-05-02 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2304.14489
Source PDF: https://arxiv.org/pdf/2304.14489
Licence: https://creativecommons.org/publicdomain/zero/1.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.