March 06, 2026

Teleoperation with natural language

 A good starting points for programming a robot is a teleoperated simulation. A possible implementation would be a python video game in which a human controls a robot gripper with the mouse. Such a system simulates a real world sceneario, in which the human also has control over a robot arm and grasps objects with a joystick.

The main disadvantage of teleoperation in the reality and in a simulation is, that the human operator is needed all the time. Even if its technically easy to implement, the missing ability to run the system autonomously are a great problem. So the question is how to increase the autonomy of the robot slightly without using very advanced AI techniques like vision language action (VLA) models.

The idea is to introduce two constraints, first the communication from the robot to the human is improved only but not the other way around and secondly the robot doesn't need to verbalize the scene in an elaborated style but its enough if the robot only annotates the scene with [tags] like [gripper_open], [collision_gripper_box] and [box_isfalling]. Each tag is a boolean value and the entire tag space is stored in a binary feature vector.

The task for the programmer is to convert the existing numerical information from the physics engine like the position and the rotation of the Box2d objects into the semantic tag space which consists of 3 or more different tags. In other words, the translation process is equal to climbing upwards in the DIKW pyramid.

The resulting system remains a teleoperated robot, but the improved software gives textual feedback to the human operator. The human operator is doing a task, e.g. stacking two boxes on top  and the robot annotates the activities with a tagging mechanism.

No comments:

Post a Comment