March 03, 2026

The slow transition from teleoperation towards grounded language

Over decades, teleoperation was imagined as joystick based control. The human operator is moving the joystick forward and this will move the RC car also forward. Such a system has no builtin Artificial intelligence but can be described in mechanical and electrical terms. The only technical requirement is, that the control signal from the remote device will reach the RC car and this allows the human to control the machine.

Implementing an artificial intelligence doesn't mean to decide for a different control system, but artificial intelligence is only a small improvement over existing numerical teleoperation. What is called AI is technically a voice based teleoperation. Instead of submitting a numerical signal to the rc car, a sentence is submitted like "move 30 cm ahead and then stop". Decoding such a signal is more demanding than building a classical RC car but its located within engineering. Its possible to imagine a text-to-servocontrol parser realized in software.

Even a voice based teleoperation remains an example for teleoperation. The rc car won't act autonomously but the RC car reacts to the input of the human operator. The difference is, that the human input is given on a higher abstraction level. Instead of pressing a joystick button during a task, the human operator formulates the task only once and then the robot is executing it.

Such kind of interaction can only be realized with natural language. Natural language acts as an abstraction mechanism which replaces low level servo control. An abstract command needs to be translated first into low level signals, e.g. the command "move until waypoint D and rotate left" can't be parsed directly by a RC car electronics but needs to be translated first. This translation takes place within the DIKW pyramid from top to bottom and its called symbol grounding.

It should be mentioned, that technically its a bit tricky to realize such a grounding algorithm in software. The initial situation is, that computers only understands numerical information but can't interpret natural language. That the reason why a programming language is used to instruct a computer to do a task. converting an English sentence direct into computer instruction is a demanding task and its no surprise that it took decades until the task was realized by computer scientists.

There are two notable projects available with the goal of voice controlled robots. Both projects were developed late in the timeline of computing. In 2003 the Ripley robot developed by Deb roy. Its a robot arm controlled by natural language and can grasp simple objects on table. The second project is the M.I.T. forklift from 2010, developed by Stefanie Tellex, which is also late in history of computer science. The forklift understands basic commands like "move the pallet to the truck" and executes the desired trajectory.

In addition the SHRDLU project from 1968 should be mentioned. In contrast to the MIT robots, SHDRLU was limited to a virtual world. It was a computer program without access to physical sensors and actuators. All the mentioned project can be alled advanced demonstrations because it was realized at a research university with a high amount of codelines.

So we can say, that technically its possible to program a voice controlled robot, but its a demanding task which requires experts knowledge in computer science. With the advent of deep learning new ideas were implemented. Instead of programming a parser algorithm, the software is based on neural network architecture, trained on a dataset. This allows to scale up the approach to more words and more robotics domains. The goal of a modern vision language action model is the same as for the ripley robot from 2003, to control a machine with natural language.

No comments:

Post a Comment