Advancing Human Motion Synthesis for Object Interaction
A new method improves human movement simulation when interacting with objects.
― 5 min read
Table of Contents
Creating realistic human movements that interact with objects is important for areas like video games, virtual reality, and robotics. In real life, people use their bodies to handle various objects while doing tasks. This work focuses on how to generate full-body human movements when interacting with large objects.
The Problem
People often manipulate various objects in everyday life. For instance, they might pull a mop, move a lamp, or place items on a desk. Accurately simulating these actions on a computer is a challenge. While there has been progress in making animated characters move in response to stationary objects, not much work has been done with moving objects. Most existing data focuses on people moving around fixed items.
The Approach
To tackle this issue, we propose a new method called Object Motion guided human Motion synthesis (OMOMO). This system uses a type of model known as diffusion to create full-body movements based only on how an object is moving. The first step in OMOMO predicts where a person's hands should be, based on the object's motion. The second step uses these hand positions to create the entire body's movements. This two-step method ensures that the hands stay in contact with the object, leading to more realistic actions.
Furthermore, we designed a system that captures human movements using just a smartphone attached to the object. This approach allows us to record how people move while interacting with various objects simply by filming them.
Data Collection
One of the biggest challenges is the lack of high-quality datasets that show how humans move while interacting with objects. To fill this gap, we collected a large dataset featuring 3D models of 15 common objects and the corresponding human movements for almost 10 hours.
The objects we focused on include everyday items such as a vacuum cleaner, a mop, and a chair. To gather the 3D models, we filmed videos of each object and used software to create the 3D shapes from these videos. We also captured human movement data using motion capture technology, which helps record how people move in real time.
Methodology
Object Motion Capture
We carefully selected the objects for our study. Each object was filmed from different angles to create a detailed 3D model. Using software, we removed any noise and refined the models to ensure they were accurate for our study.
Human Motion Capture
To capture how humans interact with these objects, we invited volunteers to perform various tasks while wearing motion sensors. Each session lasted around 1.5 to 2 hours, during which we recorded their interactions with the objects.
Data Processing
After collecting the data, we processed it to organize the information about object shapes and human movements. We used different techniques to ensure that the data was clean and suitable for our models.
System Design
Two-Stage Synthesis
Our OMOMO system features a two-stage process. In the first stage, we focus on predicting hand positions based on how objects are moving. This step ensures that the hands are accurately placed on the objects during interaction.
In the second stage, we take the predicted hand positions and generate full-body movements. This approach allows us to maintain realistic contact with the object, making the actions look more believable.
Diffusion Model
The core of OMOMO lies in using a diffusion model. This model helps create new data by gradually adding noise and then removing it in a controlled way. The result is a more refined output that represents the desired human movements interacting with objects.
Results
We conducted numerous tests to see how well our system performed. We compared OMOMO with existing methods, including simple versions of our system and other established techniques. The results showed that our two-stage model produced more realistic interactions and maintained better contact between the hands and the objects.
Evaluation Metrics
To assess our system, we looked at several factors:
- Movement Accuracy: We measured how closely the generated movements matched the actual recorded movements.
- Physical Plausibility: This involved checking if the hands were in contact with the objects as they should be and determining if any parts of the body were passing through the objects.
Overall, our method outperformed the alternatives based on these criteria.
User Studies
To further validate our results, we conducted user studies. Participants were shown pairs of motion sequences (one from our method and one from a baseline) and asked which appeared more natural. The feedback indicated that our system's outputs were preferred for their realism.
Limitations
Despite the successes, there are limitations to our approach. For one, the current datasets do not adequately reflect how fingers and hands perform detailed tasks like gripping or fine manipulation. As a result, some generated movements may appear unrealistic.
Another limitation is the handling of intermittent contact with objects. Our system currently ensures hands remain in contact, which means it struggles with scenarios where the hands might briefly lift away from an object.
Future Work
To improve our method, future research could introduce better models that account for more complex movements of the hands. Adding physics-based simulations could also help reduce anomalies in motion and improve realism.
We also plan to expand our dataset to include more diverse and intricate interactions. This would provide a better foundation for training models that can handle a wider variety of tasks and objects.
Conclusion
In summary, we have introduced a new framework for synthesizing human motion that interacts with objects. By focusing on how objects move, our method generates full-body human movements that are more realistic than previous approaches. Our system's ability to capture these interactions using a smartphone opens up new opportunities for practical applications in animation and robotics.
As we continue to develop this technology, we hope to create even more lifelike simulations of human behavior, making virtual environments feel richer and more engaging.
Title: Object Motion Guided Human Motion Synthesis
Abstract: Modeling human behaviors in contextual environments has a wide range of applications in character animation, embodied AI, VR/AR, and robotics. In real-world scenarios, humans frequently interact with the environment and manipulate various objects to complete daily tasks. In this work, we study the problem of full-body human motion synthesis for the manipulation of large-sized objects. We propose Object MOtion guided human MOtion synthesis (OMOMO), a conditional diffusion framework that can generate full-body manipulation behaviors from only the object motion. Since naively applying diffusion models fails to precisely enforce contact constraints between the hands and the object, OMOMO learns two separate denoising processes to first predict hand positions from object motion and subsequently synthesize full-body poses based on the predicted hand positions. By employing the hand positions as an intermediate representation between the two denoising processes, we can explicitly enforce contact constraints, resulting in more physically plausible manipulation motions. With the learned model, we develop a novel system that captures full-body human manipulation motions by simply attaching a smartphone to the object being manipulated. Through extensive experiments, we demonstrate the effectiveness of our proposed pipeline and its ability to generalize to unseen objects. Additionally, as high-quality human-object interaction datasets are scarce, we collect a large-scale dataset consisting of 3D object geometry, object motion, and human motion. Our dataset contains human-object interaction motion for 15 objects, with a total duration of approximately 10 hours.
Authors: Jiaman Li, Jiajun Wu, C. Karen Liu
Last Update: 2023-09-28 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2309.16237
Source PDF: https://arxiv.org/pdf/2309.16237
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.