March 02, 2026

Teleoperation with joystick and natural language

In the past, teleoperation was realized with a joystick. The human operator is navigating a robot by moving the joystick forward and backward. This allows a precise movement and the robot can do very complex tasks. The same principle is available for a construction crane and for joystick controlled UAV.

Even if joystick based teleoperation works great there is a bottleneck available because a human operator is needed all the time. A single human can control a single robot, controlling two UAV at the same time by a single operator is difficult or even impossible. From a technical perspective, a drone can receive signals with a higher frequency, the problem is that the human operator isn't able to generate the signals fast enough. To address this bottleneck a different sort of teleoperation is needed which is located on a higher level.

A slightly improvement over joystick based teleop is waypoint navigation. The human operator selects waypoints on a map and the robot is moving along the trajectory. This allows the human operator to reduce its workload. If the robot knows the next waypoint it is able to navgiate to the target by itself.

The next logical step after waypoint navigation is "grounded language control". The human operator communicates with the robot in natural language and gives a command like "move ahead, then rotate left, the move ahead for 10 meter, then stop". Such kind of language based communication reduces the workload for the human operator further. On the other hand, its a demanding task to program such an interface in a software.

Language based communication with robots is the answer to the teleoperation problem. It allows to control robots remotely with a reduced mental workload. Language has a higher abstraction level compared to a joystick control. This higher abstraction level must be translated for a robot into low level servo commands which known as "Symbol grounding". Let me explain it from a different perspective.

In classical joystick based teleoperation there is no grounding problem. The robot doesn't know terms like obstacle, shelf, move_ahead or stop. The robot understands only voltage signals transmitted from a remote control device. Such a robot can*t parse natural language but its a classical analog receiver. Of course, the human operator knows the words, he is aware that the robot enters a room and moves towards a shelf with a box. But this information is not relevant for the robot. its enough to move the joystick forward to navigate in a warehouse.

In contrast, a language based teleoperation requires that the robot understands natural language. The robot parses natural language commands and the robot gives feedback also in English.

The first electric RC toy cars were available since the 1960s. The build and operate such a car, a certain amount of knowledge in mechanics and electronics is needed. What isn't require is linguistic knowledge, because an RC car is not an English dictionary. It is a technical machine working with a battery and analog circuits. It took many decades until more advanced language controlled machines were available. One landmark project was the Ripley project in 2003 at the M.I.T, and also the voice controlled forklift at the same M.I.T. from 2010. Since the advent of vision language models in 2023, humanoid robots can be controlled with natural language.

March 01, 2026

Timeline of the symbol grounding problem

 The term itself "symbol grounding" was coined in 1990 by Stevan Harnad, but the subject was researched much earlier. From a very abstract perspective, "symbol grounding" describes the relationship between language and the reality, so its asks basically "what is language?". 

Before the advent of computers, symbol grounding was treated as linguistics and philosophy, for exmaple Aristotle has asked in his correspondence theory of truth about the mapping from language to reality. Let me give an example: Suppose somebody says "The apple is located on the table". This sentence describe the physical properties of a food item in the kitchen. It communicates an observation to someone else who is speaking the same natural language.

With the advent of the Microcomputer in the 1980s, the "symbol grounding problem" was researched as part of artificial intelligence. The goal was to use computers to process language. Notable examples are the SHDRLU project (text to action) and the Abigail scene recognition project from 1994 (scene to text). The most advanced example available today is the Wayve Lingo-1 software for controlling a self driving car. This software was designed as a neural network and can understand English language in the context of car driving.

A closer look into the timeline will show, that symbol grounding isn't a single theory or a single algorithm, but there a different approaches available initiated at different decades in research. The shared similarity is the objective to understand language. Language is important for human to human communication but is also important for human to machine communication. It seems that language is the "ghost in the machine" which allows a computer to think and take its own decisions.

The main difference between human and machines is, that machines can process language much faster. In the "karel the robot" project from 1981, its possible to submit a dozens of commands per second to the parser which translates the commands into actions in the simulated environment. Such kind of fast processing can only be realized by a computer not by human individuals. A human might understand and react to a command in the same way but at rate of 1 command per 5 second and sometimes slower.

Here is the entire timeline sorted by year:

3300 BC,Cuneiform writing system in Mesopotamia
1500 BC,sundial showing the time of the day
600 BC,Latin alphabet available in Italy
322 BC,correspondence theory of truth by Aristotle
1386,Salisbury Cathedral tower clock with a bell
1440,printing press by Johannes Gutenberg
1505,Pomander Watch by Peter Henlein
1792,optical telegraph by Claude Chappe
1844,morse code by Samuel Morse
1870,Engine Order Telegraph by William Chadburn
1876,commercial typewriter by Remington
1878,chronophotography "The Horse in Motion" by Eadweard Muybridge
1903,Telekino remote controlled boat by Leonardo Quevedo
1915,Therblig notation by Frank Gilbreth
1915,rotoscoping animation technique by Max Fleischer 
1920,AAC Communication board by F. Hall Roe
1928,Labanotation dance notation by Rudolf von Laban
1930,motion tracking by Nikolai Bernstein
1949,Turing test by Alan Turing
1959,Pandemonium architecture by Oliver Selfridge
1962,ANIMAC motion capture by Lee Harrison III
1963,ASCII code
1966,ELIZA chatbot by Joseph Weizenbaum
1968,SHRDLU natural language understanding by Terry Winograd
1971,Lexigram for communicating with apes by Ernst von Glasersfeld
1977,Zork I text adventure by Tim Anderson
1977,Tour model instruction following by Benjamin Kuipers MIT AI lab
1980,Chinese room argument by John Searle
1980,Commentator scene description by Bengt Sigurd
1980,Finite State machine in Pacman videogame by Tōru Iwatani
1981,Karel the robot programming language by Richard Pattis
1983,MIDI music protocol
1983,M.I.T. Graphical Marionette by Delle Maxwell
1984,Castle Adventure by Kevin Bales
1987,Maniac Mansion point&click adventure by Ron Gilbert
1987,Vitra visual translator by Wolfgang Wahlster
1990,Physical Grounding Hypothesis by Rodney Brooks
1990,paper "The symbol grounding problem" by Stevan Harnad
1993,AnimNL computeranimation by Norman Badler
1993,conceptual spaces by Peter Gardenfors
1994,Abigail scene recognition by Jeffrey Siskind
1998,Rocco Robocup commentator by Dirk Voelz
1999,trec-8 Text REtrieval Conference
2003,M.I.T. Ripley robot by Deb Roy
2006,Marco route instruction following by Matt MacMahon
2007,Simbicon computer animation by Michiel Panne
2010,Motion grammar by Mike Stilman
2010,M.I.T. forklift by Stefanie Tellex
2011,IBM Watson Question answering by David Ferrucci
2013,Word2vec algorithm by Tomas Mikolov
2015,Poeticon++ trajectory recognition by Yiannis Aloimonos
2015,DAQUAR VQA dataset by Mateusz Malinowski
2020,Vision language model by different authors
2023,Wayve Lingo-1 self driving car

Perhaps it makes sense to focus on language itself. Language in its core meaning is natural language like English or French. It was invented a long time ago as a tool similar to a hammer or the steam engine but not as a physical device but language acts as a mental tool. Languages are very old innovations, for example the alphabet with 26 characters from A to Z is known for over 2600 years.

The new thing known as the symbol grounding problem is a more technological perspective towards language. Instead of only learning a language which means to memorize the vocabulary, the task is to understand what the purpose is of English. Or to be more specific, how language allows human to think. This question is upto date an unsolved problem. There are some signs avaialble that language is processed by the brain, also its known that artificial neural network simulated by a computer can imitate this behavior. This allows to use machines to parse natural langauge including its mapping towards the reality.