March 02, 2026

Teleoperation with joystick and natural language

In the past, teleoperation was realized with a joystick. The human operator is navigating a robot by moving the joystick forward and backward. This allows a precise movement and the robot can do very complex tasks. The same principle is available for a construction crane and for joystick controlled UAV.

Even if joystick based teleoperation works great there is a bottleneck available because a human operator is needed all the time. A single human can control a single robot, controlling two UAV at the same time by a single operator is difficult or even impossible. From a technical perspective, a drone can receive signals with a higher frequency, the problem is that the human operator isn't able to generate the signals fast enough. To address this bottleneck a different sort of teleoperation is needed which is located on a higher level.

A slightly improvement over joystick based teleop is waypoint navigation. The human operator selects waypoints on a map and the robot is moving along the trajectory. This allows the human operator to reduce its workload. If the robot knows the next waypoint it is able to navgiate to the target by itself.

The next logical step after waypoint navigation is "grounded language control". The human operator communicates with the robot in natural language and gives a command like "move ahead, then rotate left, the move ahead for 10 meter, then stop". Such kind of language based communication reduces the workload for the human operator further. On the other hand, its a demanding task to program such an interface in a software.

Language based communication with robots is the answer to the teleoperation problem. It allows to control robots remotely with a reduced mental workload. Language has a higher abstraction level compared to a joystick control. This higher abstraction level must be translated for a robot into low level servo commands which known as "Symbol grounding". Let me explain it from a different perspective.

In classical joystick based teleoperation there is no grounding problem. The robot doesn't know terms like obstacle, shelf, move_ahead or stop. The robot understands only voltage signals transmitted from a remote control device. Such a robot can*t parse natural language but its a classical analog receiver. Of course, the human operator knows the words, he is aware that the robot enters a room and moves towards a shelf with a box. But this information is not relevant for the robot. its enough to move the joystick forward to navigate in a warehouse.

In contrast, a language based teleoperation requires that the robot understands natural language. The robot parses natural language commands and the robot gives feedback also in English.

The first electric RC toy cars were available since the 1960s. The build and operate such a car, a certain amount of knowledge in mechanics and electronics is needed. What isn't require is linguistic knowledge, because an RC car is not an English dictionary. It is a technical machine working with a battery and analog circuits. It took many decades until more advanced language controlled machines were available. One landmark project was the Ripley project in 2003 at the M.I.T, and also the voice controlled forklift at the same M.I.T. from 2010. Since the advent of vision language models in 2023, humanoid robots can be controlled with natural language.

March 01, 2026

Timeline of the symbol grounding problem

 The term itself "symbol grounding" was coined in 1990 by Stevan Harnad, but the subject was researched much earlier. From a very abstract perspective, "symbol grounding" describes the relationship between language and the reality, so its asks basically "what is language?". 

Before the advent of computers, symbol grounding was treated as linguistics and philosophy, for exmaple Aristotle has asked in his correspondence theory of truth about the mapping from language to reality. Let me give an example: Suppose somebody says "The apple is located on the table". This sentence describe the physical properties of a food item in the kitchen. It communicates an observation to someone else who is speaking the same natural language.

With the advent of the Microcomputer in the 1980s, the "symbol grounding problem" was researched as part of artificial intelligence. The goal was to use computers to process language. Notable examples are the SHDRLU project (text to action) and the Abigail scene recognition project from 1994 (scene to text). The most advanced example available today is the Wayve Lingo-1 software for controlling a self driving car. This software was designed as a neural network and can understand English language in the context of car driving.

A closer look into the timeline will show, that symbol grounding isn't a single theory or a single algorithm, but there a different approaches available initiated at different decades in research. The shared similarity is the objective to understand language. Language is important for human to human communication but is also important for human to machine communication. It seems that language is the "ghost in the machine" which allows a computer to think and take its own decisions.

The main difference between human and machines is, that machines can process language much faster. In the "karel the robot" project from 1981, its possible to submit a dozens of commands per second to the parser which translates the commands into actions in the simulated environment. Such kind of fast processing can only be realized by a computer not by human individuals. A human might understand and react to a command in the same way but at rate of 1 command per 5 second and sometimes slower.

Here is the entire timeline sorted by year:

3300 BC,Cuneiform writing system in Mesopotamia
1500 BC,sundial showing the time of the day
600 BC,Latin alphabet available in Italy
322 BC,correspondence theory of truth by Aristotle
1386,Salisbury Cathedral tower clock with a bell
1440,printing press by Johannes Gutenberg
1505,Pomander Watch by Peter Henlein
1792,optical telegraph by Claude Chappe
1844,morse code by Samuel Morse
1870,Engine Order Telegraph by William Chadburn
1876,commercial typewriter by Remington
1878,chronophotography "The Horse in Motion" by Eadweard Muybridge
1903,Telekino remote controlled boat by Leonardo Quevedo
1915,Therblig notation by Frank Gilbreth
1915,rotoscoping animation technique by Max Fleischer 
1920,AAC Communication board by F. Hall Roe
1928,Labanotation dance notation by Rudolf von Laban
1930,motion tracking by Nikolai Bernstein
1949,Turing test by Alan Turing
1959,Pandemonium architecture by Oliver Selfridge
1962,ANIMAC motion capture by Lee Harrison III
1963,ASCII code
1966,ELIZA chatbot by Joseph Weizenbaum
1968,SHRDLU natural language understanding by Terry Winograd
1971,Lexigram for communicating with apes by Ernst von Glasersfeld
1977,Zork I text adventure by Tim Anderson
1977,Tour model instruction following by Benjamin Kuipers MIT AI lab
1980,Chinese room argument by John Searle
1980,Commentator scene description by Bengt Sigurd
1980,Finite State machine in Pacman videogame by Tōru Iwatani
1981,Karel the robot programming language by Richard Pattis
1983,MIDI music protocol
1983,M.I.T. Graphical Marionette by Delle Maxwell
1984,Castle Adventure by Kevin Bales
1987,Maniac Mansion point&click adventure by Ron Gilbert
1987,Vitra visual translator by Wolfgang Wahlster
1990,Physical Grounding Hypothesis by Rodney Brooks
1990,paper "The symbol grounding problem" by Stevan Harnad
1993,AnimNL computeranimation by Norman Badler
1993,conceptual spaces by Peter Gardenfors
1994,Abigail scene recognition by Jeffrey Siskind
1998,Rocco Robocup commentator by Dirk Voelz
1999,trec-8 Text REtrieval Conference
2003,M.I.T. Ripley robot by Deb Roy
2006,Marco route instruction following by Matt MacMahon
2007,Simbicon computer animation by Michiel Panne
2010,Motion grammar by Mike Stilman
2010,M.I.T. forklift by Stefanie Tellex
2011,IBM Watson Question answering by David Ferrucci
2013,Word2vec algorithm by Tomas Mikolov
2015,Poeticon++ trajectory recognition by Yiannis Aloimonos
2015,DAQUAR VQA dataset by Mateusz Malinowski
2020,Vision language model by different authors
2023,Wayve Lingo-1 self driving car

Perhaps it makes sense to focus on language itself. Language in its core meaning is natural language like English or French. It was invented a long time ago as a tool similar to a hammer or the steam engine but not as a physical device but language acts as a mental tool. Languages are very old innovations, for example the alphabet with 26 characters from A to Z is known for over 2600 years.

The new thing known as the symbol grounding problem is a more technological perspective towards language. Instead of only learning a language which means to memorize the vocabulary, the task is to understand what the purpose is of English. Or to be more specific, how language allows human to think. This question is upto date an unsolved problem. There are some signs avaialble that language is processed by the brain, also its known that artificial neural network simulated by a computer can imitate this behavior. This allows to use machines to parse natural langauge including its mapping towards the reality. 

February 28, 2026

Erstellen einer wissenschaftlichen Hausarbeit mit Hilfe von Large Language Modelle zum Thema Halle 54 und die Automatisierung in den 1980er Jahren

 __Einleitung__

Von Large Language modellen wie chatgpt und Google Gemini ist bekannt dass sie kleinere Recherchen unterstützen können und technisch in der Lage sind, die Rechtschreibkorrektur einer wissenschaftlichen Ausarbeitung zu übernehmen. Unklar war hingegen, ob Large Language modelle auch eine komplette Hausarbeit verfassen können. Eine solche Aufgabe erfordert üblicherweise einen menschlichen Aufwand von 1 Monat und länger und liegt damit außerhalb der Leistungsfähigkeit heutiger KI Systeme. Dies behauptet zumindest der https://metr.org/ benchmark. Danach können die derzeit leistungsfähigen neuronale Netze Programmier Aufgaben ausführen für die Menschen rund 10 Stunden benötigen, z.B. das Implementierungen eines Netzwerkprotokolls.

Will man längere komplexe Tasks mit Hilfe von LLMs bearbeiten benötigt man eine spezielle Reward funktion, ein Multiagentensystem oder ähnliche Hilfsmittel weil sonst die Gefahr besteht, dass die KI sich in einer endlos Schleife verfängt, Also bereits erstellten Quellcode oder vorhandene Texte erneut editiert ohne dass ein erkennbarer Fortschritt sichtbar wird.

Im folgenden Fall wurde ein anderes Konzept verwendet, was als Luhmann Zettelkastenmethode bekannt ist. Diese Methode wird in den Geisteswissenschaften verwendet um eine Hausarbeit zu ordnen und hilft ebenfalls dabei die Interaktion mit einem Large language modell zu strukturieren.

Als Thema der Hausarbeit wurde gewählt "Halle 54 Automatisierung in den 1980er Jahren" weil es gut eingrenzbar ist und mit etwas Literaturrecherche leicht in einen wissenschaftlichen Text überführt werden kann. Zuerst benötigt man einen Prompt um das Problem für ein LLM zu schildern:

__Prompt__

titel: Die Halle 54 bei VW als gescheitertes Automatisierungsprojekt in den 1980er Jahren

Aufgabe: Erstelle 8 Luhmann Karteikarten zum Titel. Jede Karteikarte enthält eine Luhmann ID, einen Titel, und Stichpunktartige Notizen welche ruhig chaotisch sein können. Stelle sicher dass weitere künftige Karteikarten angefügt werden können. Ausgabesprache ist Deutsch.

Inhalt: Ungefähr im Jahr 1983 gab es beim Autohersteller VW ein Robotik Automatisierungsprojekt in der Halle 54. Damals wurden computergesteuerte Roboter eingesetzt um das Ziel der Vollautoamtisierung der Fahrzeugproduktion umzusetzen. Später stellte sich heraus, dass der angestrebte hohe Automatisierungsgrad technisch nicht machbar ist. Die damalige Hard- und Software blieb hinter den hohen Erwartungen zurück.
-----
Beides, erzeuge ingesamt 8 weitere Karteikarten.
Ja, und erzeuge weitere Karten zur verwendeten Software beim Halle 54 Projekt (wenn es dazu Informationen gibt)
Erstelle für die bisherigen Karteikarten einen Strukturzettel als Gliederung für eine wissenschaftliche Hausarbeit.
Nein, beginne stattdessen mit dem Schreiben des Volltextes für das Kapitel "1. Einleitung: Der Traum von der menschenleeren Fabrik" auf basis der vorhandenen Karteikarten. Der Volltext sollte rund 800 Worte enthalten.
-----


Wie im prompt gefordert erzeugte die KI zuerst einmal Karteikarten und zwar 24 stück. Anschließend wurde ein Strukturzettel erstellt, also eine Karteikarte die auf andere Karteikarten verweist. Diese Karteikarten wurden dann in einen Fließtext überführt der hier vollständig abgedruckt ist.

Im Fließtext verstreut finden sich Referenzen zu den Luhmann Karteikarten, z.B. "(ID 3.5)". Der Text ist also nur die Ausformulierung der vorhandenen Notizen. Über den Zwischenschritt "Karteikarten" ist es möglich, auch sehr umfangreiche Themen abzubilden.

__Kritik__

Für das vorliegende Experiment wurden lediglich 24 Karteikarten plus 1 Strukturzettel von einem LLM erstellt. Für eine echte wissenschaftliche Hausarbeit benötigt man mehr Karteikarten und zwar ungefähr 100+.

__Volltext __

 halle54.pdf

Scene annotation with tags

The pictures shows a very simple landscape with a house, a lake and a sun. The problem for a robot is, that the information shown in the picture can't be parsed. The reason is that the picture is technically a 800x600 bitmap file and doesn't provide semantic meaning.

What a robot needs to understand a picture is a sequence of tags. For the concrete picture the tags would be: [sun], [house], [tree], [lake]. In a more elaborated setup the tags would be enhanced with additional informaiton about the position and the color, e.g.
- sun: top right, yellow
- house: center, red
- tree: left, green
- lake: bottom left, blue

These information can be stored in a database which is a json file. Such information can be parsed by a robot. So the missing part is a scene to tag converter which is the core element for a symbol grounding system.

February 25, 2026

The technology of Stanley self driving car in 2005

In addition to the previous blog post which describes the Darpa challenge in 2007, the technology for the competition held in 2005 should be described next. All the information are taken from the document [1].

At first it should mentioned, that the winning car Stanley was the most advanced robot car at this time. It was designed by a team of experienced engineers with a university background and has proven in an official competition its superior over alternative concepts. What makes the situation interesting from today's perspective (the year 2026) is, that the entire technology stack can be called outdated. Stanley including all the mentioned hardware and software has only a value for a museum and is very different from the technology used today.

Even for the timeline of computer science, which is known for its rapid development, such kind of fast aging process is a surprise. The usual assumption is, that at least some of the technology is valid in a modified version for later robotics projects. But this was not the case. It seems, that Artificial intelligence since 2005 has drastically reinvented itself. But let us take a closer look into the year 2005.

The physical hardware of the robot was a Volkswagen Touareg R5 TDI, with a diesel engine.[1] page 2. The engine was powered by gas which comes from a gas station. The track was provided in a waypoint text file, in the RDDF format [1] page 3. The vehicle was equipped with multiple sensors like SICK laser, GPS camera, compass [1] 4. The CPU was located in the trunk of the car and was an Intel Pentium M CPU which was used also for laptops. It was running the Linux operating system on six different machines. [1] 5.

The software stack was divided into 30 modules for sensory perception, path planning, logging and steering. The perhaps most advanced module was the self localisation module which was a particle filter, based on a kalman filter [1] page 8. Multiple incoming sensor streams were fusioned with a probabilistic estimation. The vision module was responsible to detect the drivable area in the map [1] page 12. Handcrafted computer vision algorithm were utilized. The steering controlled was realized as a PID control mathematical equation, [1] page 24.

In summary, the technology used in the Stanley self driving car was a classical combination of a diesel vehicle, a computer cluster in the trunk and a large amount of software which implemented algorithms for vision and steering. In other terms, existing and well known software engineering principle were adapted to robotics development. The idea was that a self driving car is some sort of Open source software project with additional mathematical algorithms for road navigation. Typical problems during the project were:

- how to connect the computer in the trunk with the CAN bus of the car
- how to write all the software modules
- how to make the sensory loop fast enough with C/C++ code

It was the same principle used 2 years later during the 2007 darpa urban challenge and it was valid by all of the teams during this time.

Software engineering has a certain name for such projects: rapid prototyping better known as Throwaway-Software. Its a software system that was written in a short amount of time and has a limited lifespan. None of the hardware and software developed for stanley was reused in later projects. In other words, despite that Stanley has won the challenge the technology was obsolete a few weeks after the race was over.

sources:  

[1] Thrun, Sebastian, et al. "Stanley: The robot that won the DARPA Grand Challenge." Journal of field Robotics 23.9 (2006): 661-692.

February 24, 2026

Darpa urban challenge 2007 -- the last great robot project

 Before the year 2010, artificial intelligence was mostly a niche discipline within computer science without any impact to society. The reason was, that most of the projects were in an early stage and lots of technical obstacle were visible. Because of this limitation its interesting from a science history perspective to take a closer look what the self understanding was of AI in the past.

The goal of the Darpa urban challenge 2097 was to program self driving cars for an urban environment. These normal size cars were able to stop at a junction and do some parking maneuvers. According to the large amount of documentation and some of the talks from the teams its possible to extract some general principle how the cars were realized. Building safe driving cars in the past was recognized as a hardware and software challenge. One problem was to squeeze high performance server racks into the car's trunk. A second and more serious problem was to write all the software.

One team has written software with 100k lines of code, the next one has even created a software with 500k lines of code. The idea in 2007 was to treat self driving car software similar to a large scale software project similar to the linux kernel. Therfor the existing toolchain was used, namely a C/C++ compiler and modern version control systems including bug trackers. The logic of the car was encoded in endless amount of path planning algorithms, C++ classes and dedicated particle filter for self-localization.

It should be mentioned that the outcome of these large scale projects was poor. Despite the fact that a team of experienced programmers have written all the code, the resulting autonomous car was unable to navigate on the street. Simple task like waypoint following was successful demonstrated, but more complex problems like a road block and unexpected situation have overwhelmed the car's AI software.

On the one hand, the shown self driving cars were more powerful than every attempt in the past to build such vehicles. At the same time it was obvious that these cars were not ready for real world traffic. One disappointed detail was that all the written C/C++ software was only working for the original car but can*t be adapted to other cars or another sort of robotics vehicle. In technical terms, the software wasn't scaling up to slightly different problems which is a sign of bad software design.

In 2007 it was unclear how to write better software which fits to the needs of self driving cars. The reason was a certain bias about the project which was: a) the car is designed as an autonomous vehicle b) the decision making process is implemented in software and planning algorithms. So it was a autonomous computational vehicle which is from a modern AI perspective a dead end. In 2007 nobody was able to see the limitation of these constraints but it was imagined, that AI has to be realized this way.

Let us go a step backward and describe the motivation for the darpa urban challenge. The self understanding in 2007 about robotics was, that robotics is a hardware and software problem and located within computer science. The goal was to make sure that the hardware of a self driving car is working, which means that the lidar is rotating fast enough and that the powerful server build into the car gets enough electricity. The second goal was to program the software which means to utilize the C/C++ and implement powerful algorithms in a robot control system. The hope was that the combination of hardware and software would enable a robot car to take its own decision.

February 23, 2026

Symbol grounding with a DIKW pyramid


 A possible model to explain the symbol grounding problem is the DIKW pyramid. Grounding means to translate a higher layer in the pyramid into a lower layer. The layers are representating the same reality in different formats. The perhaps most important transition is from the numerical data layer into the labeled data. For a warehouse robot a GPS sensor reading like (40,10) gets translated into [roomB]. So the low level sensor data gets annotated with tag.

The next layer is the knowledge graph which encodes the tagging information into a semantic network. The realations between the tags are explained, synonyms are introduced and the information are stored in a json file. If all the layers in the DIKW pyramid are established and if an automatic parser can translate upward and downward in the pyramid, its possible to communicate with a robot in natural language. A voice command like "go to roomB and bring me the yellow box" is understood by the robot and executed in the reality.

February 22, 2026

Minsky frames as communiation tool

 

A minsky frame is a list of key/value pairs as text overlay in a GUI window. Its not intended as an internal data structure within a robot but its GUI gadget which displays information about the game on the screen.

For the example of an intersection simulator the minsky frame was realized with the pygame command:

screen.blit(txt, (35, y))

Which draws a text string to the screen, for example the information "exit_target: WEST". A minsky frame is some sort of form which determines which aspects of the reality are important. The computer determines the value for each item and shows the result on the screen. This allows to solve the symbol grounding problem because the shown text overlay translates the data layer of the DIKW pyramid into the information of the DIKW pyramid.