July 02, 2025

VLA models for reproducing motion capture trajectories

 Over decades, an important but unsolved problem was available in robotics: How to rpreoduce motion capture demonstration? The initial situation was, that a teleoperated robot was able to pick&place objects in a kitchen, all the data were recorded with a computer, but the replay of these data wasn't working. The reason is, that if the same motor movements are submitted during the replay step to the robot, these motor movements will result into chaotic behavior, because the objects are in different position, and new obstacle might be there not available during motion demonstration.

The inability to replay recorded movement prevented to develop more advanced robots and it was a major criticism against motion capture and teleoperation in general. Some attempts like Kinesthetic Teaching were used Robotics to overcome the bottleneck including preprogramming of keyframes, but these minor improvement didn't solved the underlying problem.

A possible answer to the replay problem in motion capture are Vision Language action models which should be explained briefly. The idea is to create an additional layer which is formulated in natural language. A neural network converts the mocap recording into natural language and then action are generated for the perceived symbols. The natural language layer increases the robustness and it allows to fix possible errors in the motion planner. The AI engineer can see in the textual logfile, why the robot has failed in a certain task. For example, a certain object was labeled wrong, or the motion planner has generated a noise trajectory. These detail problems can be fixed within the existing pipelines.

Vision language action models (short VLA model) are solving the symöbol grounding problem. They are translating low level sensory perception into high level natural language. The resulting symbolic statespace has the same syntax like a text adventure and can be solved with existing PDDL like planners. Let me give an example for a longer planning horizon.

Suppose a robot should clean up a kitchen. At first, the needed steps on a high level layer are generated, e.g. removing the objects from the table, transport the objects into the drawer, cleaning the table, cleaning the floor. These abstract steps are formulated in words, similar to a plan in a text adventure. In the next step the high level actions are translated into low level servo commands. The servo commands are submitted to the robot which cleans up the kitchen.

The single cause of failure is the translation between the high level and the low level layyer. The robot needs to convert sensory perception into language, and language into motor actions. A VLA model implements such a translation.

No comments:

Post a Comment