Since the year 2023, there are Large language models (LLM) available which are soem sort of advanced chatbots. A LLM can answer question, programs a computer code and can paint an image. Even if these systems are looking powerful there is a much more advanced technology available not released yet which is a VLA model.
VLA stands for vision language action model. It can handle text in combination with robotic action which is needed to control biped robots and drones both. The user interface looks similar to a LLM because there is a text box and the user enters a prompt. The difference is, that the AI software will convert the prompt into action. An example prompt might be "walk in a circle" "bring me the red ball".
Similar to a LLM, a VLA Model works with natural language. The AI won't do anything by its own but its a text based interaction between human and machine. The innovation is, that the output of the AI isn't restricted to a text window on the monitor but the AI has access to servo motors in the reality or can control ingame characters in a videogame. Such kind of AI is available in research prototypes and was described in academic papers but its not available as commercial product for everyone.
Current LLM can simulate the behavior in parts today. Its possible to upload a JPEG image to the internet and the AI can describe the picture with words. Such kind of picture to text annotation seems a bit useless, because its obvious what is shown on the picture. So the feature is used seldom in the reality. Only in combination with actuator control of a robot it makes sense to annotate pictures. Because the robot needs to transform the camera signal into text and then take decisions in response to the information.
Robotics and Artificial Intelligence
June 12, 2026
VLA models -- the upcoming revolution in AI
AI the big picture
AI isn't new but was researched since decades by multiple researchers. They have investigated andless amount of theories and algorithms for different subjects. To get a better picture what the AI community has researched in the past, the working thesis is, that there was a transition from closed systems in the past, to open systems in the present time. This working thesis should be explained briefly.
A closed system is the natural understanding in computing. It assumes that a software runs on a computer, and the programmer has to write down the source code including the algorithm. A typical example is a model predictive control algorithm which takes a physics engine to predict future states, or a path planning algorithm like RRT which searches for the shortest path. These approaches are imitating classical computer science paradigm which are working with the same technique.
The idea of a closed AI system is to grasp the reality in mathematical terms and write a computer program which solves a mathematical optimization problem. Such kind of appraoch was common in AI history until the 1990s. The only debate was about which algorithm was prefered, for example neural network or an alpha beta pruning algorithm.
It should be mentioend, that closed systems are not powerful enought to tackle advanced probloems. Especially in the domain of robot control, the paradigm fails every time, because of the state space explosion. There is no algorithm available which can handle millions of joint configurations of a biped robot. That was the reason why some pessimistic AI researchers in the past have assumed, that its not possible to solve np hard problems in AI.
A more powerful paradigm is an open system. Early examples are motion capture systems from the 1980s which are recording the position of markers in real time. Such a system is open because it tries to capture data from the environment, here mocap data. Another example of an early open system are text adventures like Zork I which puts also a great priority on human to machine interaction. Modern open systems developed after the year 2000 are using advanced interfaces based on text and sensory data. These systems are open because the input send to the computer is the most important information. A human operator might speak "Move to north and grasp the blue box". or another human operator might demonstrate a walking pattern in a motion capture suite and the robot has to repeat the trajectory. In open systems, the man to machine interaction stands in the center of attention. Possilble technologies like certain algorithms, a certain neural network or a database is groupoed around this principle. For example, a neural network might used to deterect the mocap markers, while a SQL database is used to store the realtime data, and then a rendering algorithm might fetch the database and paint the human pose on the screen.
From a technical perspective, these algorithms are trivial and most of them were available before the 1990s. The innovation is the context in which they are used which is human to machine interaction. The existing software libraries are not used to build closed systems e.g. a genetic algorithm which tries to improve itself, but they are used to parse textual input or annotate sensor data with textual [tags].
June 10, 2026
Matching game in python
The font-name needs to be adjusted according to the operating system, otherwise only a question mark is shown in the window.
import pygame
import sys
import time
# Pygame initialisieren
pygame.init()
# Fenstergröße
WIDTH, HEIGHT = 640, 480
screen = pygame.display.set_mode((WIDTH, HEIGHT))
pygame.display.set_caption("Emoji-Text-Matching")
# Farben
WHITE = (255, 255, 255)
BLACK = (0, 0, 0)
BLUE = (0, 0, 255)
# Schriftarten (mit Unicode-Unterstützung)
# font_large = pygame.font.SysFont("Segoe UI Emoji", 120) # Für Emoji Windows
font_large = pygame.font.SysFont("Noto Color Emoji", 150) # Für Emoji Linux
font_small = pygame.font.SysFont("Arial", 30) # Für Text
# Emoji-Text-Paare (20 Einträge)
pairs = [
("🐶", "Hund"),
("🐱", "Katze"),
("🐭", "Maus"),
("🐹", "Hamster"),
("🐰", "Hase"),
("🦊", "Fuchs"),
("🐻", "Bär"),
("🐼", "Panda"),
("🐨", "Koala"),
("🐯", "Tiger"),
("🦁", "Löwe"),
("🐮", "Kuh"),
("🐷", "Schwein"),
("🐸", "Frosch"),
("🐵", "Affe"),
("🐒", "Affe2"),
("🐺", "Wolf"),
("🐗", "Wildschwein"),
("🦊", "Fuchs"),
("🐝", "Biene"),
("🐛", "Raupe"),
("🔪", "Messer"),
("🔦", "Taschenlampe"),
]
# Position für Emoji und Text (zentriert)
emoji_x, emoji_y = WIDTH // 2, HEIGHT // 3
text_x, text_y = WIDTH // 2, emoji_y + 150
# Hauptspielschleife
def main():
clock = pygame.time.Clock()
running = True
current_pair_index = 0
while running:
for event in pygame.event.get():
if event.type == pygame.QUIT:
running = False
# Hintergrund
screen.fill(WHITE)
# Aktuelles Paar anzeigen
if current_pair_index < len(pairs):
emoji, text = pairs[current_pair_index]
# Emoji groß anzeigen
emoji_surface = font_large.render(emoji, True, BLACK)
emoji_rect = emoji_surface.get_rect(center=(emoji_x, emoji_y))
screen.blit(emoji_surface, emoji_rect)
# Text darunter
text_surface = font_small.render(text, True, BLUE)
text_rect = text_surface.get_rect(center=(text_x, text_y))
screen.blit(text_surface, text_rect)
# Nächstes Paar nach 1 Sekunde
time.sleep(1)
current_pair_index += 1
else:
# Alle Paare gezeigt: Beenden oder neu starten
font_done = pygame.font.SysFont("Arial", 40)
done_text = font_done.render("Alle Paare gezeigt!", True, BLACK)
done_rect = done_text.get_rect(center=(WIDTH // 2, HEIGHT // 2))
screen.blit(done_text, done_rect)
# Aktualisieren des Displays
pygame.display.flip()
clock.tick(30)
pygame.quit()
sys.exit()
if __name__ == "__main__":
main()
June 07, 2026
What is Artificial Intelligence?
In contrast to a famous myth, there is an answer available to this question because researchers have investigated the subject for decades. The most famous and easy to understand definition aka introduction towards the subject is a computer chess player. The computer is able to decide for the next move on the board and a modern chess program can beat even a grandmaster.
Computer chess explains at the same time, what current Artificial Intelligence can't provide yet. There is a difference available between a program like gnuchess and a robot. Gnuchess is only able to play chess, while a robot has to do more complex tasks. AI research since the 1980s was devoted towards the goal to improve the skills of a computer.
A promising approach is a reward function based on grounded language. In contrast to a fixed reward function which is used in computer chess, a parametric reward function based on natural language can be modified on the fly. This allows a computer to understand instrauctions like "move to the blue box and grasp it". This command is translated into a reward signal and the computer can plan a trajectory to maximize the reward.
Let us compare computer chess with instruction following in robotics. Computer chess is based on a single fixed evaluation function which converts the current board into a reward signal e.g. 0.4. This numerical information is used by the alpha beta prunning algorithm to find the optimal action. The planner is traversing the game tree upt to 10 steps into the future and decides for an action which maximaizes the reward. This is equal to win the game.
In contrast, instruction following in robotics is offloading the reward signal to a speaker located outisde of the robot. The speaker, determines by its command what the current subgoal is in the game. A possible command might be:
1. "if the battery is empty search for the charging station"
2. "grasp the red box"
3. "bring the red box into room C".
In contrast to the game of chess which has a single goal which remains the same, a warehouse robot can have multiple goals which are acivated in a sequence. The AI makes sure, that the robot understands a goal, in a mathematical sense. Understanding means, that the robot determines the numerical reward for a textual command. For example, if the goal is "grasp the red box" the robot will receive a reward if the gripper moves towards the box and another reward for closing the gipper around the box.
The problem for the programmers and AI engineers is to encode the reward function including the natural language parser in software. A robot who understands a dozens of commands comes close to the goal of building an intelligence machine.
The purpose of a command based reward function is to transform a closed system into an open system. Open means, that the robot is communicating with its environment. The need for doing so is because the robot itself has insufficient knowledge about the task, on the other hand the human operator has much more knowledge. It makes sense to offload the planning task towards the human operator.
In chess playing AI systems from the past with a fixed evaluation function it was not possible to interact with the system during runtime. The only strategy to modify the reward was to stop the program, modify the source the source code and restart the software.
June 02, 2026
Grounding mechanism 1o1
A DIKW pyramid consists of abstraction layers like Data, information and other. A grounding mechamism maps the items in the layer. In an example warehouse robot, the data layer cosnsits of sensor readings like GPS Coordinates, lidar distance, and battery capacity while the information layer consists of [tags] like "battery_full, north, obstacle_ahead".
The grounding mechanism generates the links between the entries. For example the lidar distcance of 10 cm is mapped to "obstacle_ahead" while the battery level of 10% is mapped to "Battery_empty".
In general, a grounding mechanism is some sort of matching game. it answers the question which situation is mapped to which description. Such a mapping is the core element of an advanced artificail intelligence.
To demonstrate why a matching game enables artificial intelligence let us assume an example. Suppose the human operator submits a command to the warehouse robot which is "move to the green area, grasp the small box on the left side, bring the box to the blue area, drop it into the shelf, then recharge your battery".
If the grounding mechanism is missing or was deactated, the command is interpreted as string with 144 characters. It wasn't formulated in the C/C++ programming langauge but it can be stored only in the main memory.
Suppose the robot has a builtin grounding mechanism, than its possible to parse the sentence word by word. The word "green" is matching to a certain RGB value, the word "box" is mapped to a certain shape in the camera, the word "shelf" is mapped to a picture of the shelf and so on. The parsing algorithm fetches a word from the sentences, and takes a lookup into the database to identify the item from the data layer of the DIKW pyramid. Understanding a sentence from a robots perspective has to do with matching items from the information layer to the data layer.
June 01, 2026
Symbol grounding problem as answer to np hard algorithms
Before its possible to describe grounded language there is a need to explain who artificial intelligence was imagined until the year 1990. It was treated similar to computer programming in the sense that there is a CPU which executes a program and its up to the programmer to make the algorithm as intelligent as possible. Artificial intelligence was thought as a very advanced computer programmed which is executed by a computer.
In other terms, the computer was seen as a problem solving machine and the only detail problem was which sort of algorithm is needed to solve a certain problem. For example motion planning in robotics was solved with motion planning algorithms while computer chess was solved with alpha beta prunning algorithms. Most of these AI related algorithms were designed as search algorithms. The computer was used to traverse the state space of the domain and this allowed the computer to find the optimal action.
The symbol grounding problem formulated by Stevan Harnad questions this algorithm oriented paradigm. This might explain why even today grounded language is a niche topic within computer science. Because computer science and algorithms were often treated as the same thing, it was outside of the scope how to program a computer without an algorithm.
Let us listen closely how Harnad, Brooks and Steels are arguing about grounded language. The core element is the sensory perception of a robot. The assumption is that the perception is transmitted to the computer. There is no need to calculate something but the focus on the data transfer. A light sensor detects light and the information from the sensor is send over a cable to the computer. The symbol grouding problem doesn't focus on the computer itself, but on the cable between a sensor and a computer, very similar to a computer network. Computer networks are different from a turing machine, they are never running algorithms, but a computer network communicates data often organized in a protocol layer.
The paradigm shift from algorithm centric computers towards protocol oriented data transmission is the core element of the symbol grounding problem. Artificial Intelligence isn't explained as processing or program executation, but Artificical Intelligence is imaged as the air gap between two hosts.
Let us compare the hardware. In classical algorithm oriented AI the basic building block is a central processing unit, which can be a 32bit CPU. The CPU is built with transistors on a chip and gets controlled by Assembly language. In contrast, the symbol grounding problem assumes that there is a Cat5 copper cable which delivers packets. Its up to the network engineer to define the protocol of the packets.
The paradigm shift can be explained for np hard problems. NP hard is a certain category of problems related to artificial intelligence which can't be solved with a computer. Nearly all robotics motion planning problems like the piano movers problem or model predictive control are np hard. The term np hard is referencing to the runtime of an algorithm executed on a cpu. In other words, even a modern 64bit CPU can't solve these problems because the hardware is too slow.
The holy grail in computer science is how to solve np hard problems. The answer was given by Stevan Harnad in his famous 1990 paper. He didn't mentioned np hard problems, but its possible to solve np hard problem with grounded language. Instead of using a CPU to calculate a mathematical problem, a copper cable is used to solve a data transmission problem. This new perspective is powerful enought to solve motion planning problems in robotics.
May 30, 2026
The transiton from closed to open robotics systems
The last AI winter went until the late 1990s. In this period, some robotics were built by the engineers and some AI algorithms were designed but all of them failed. The only thing working reliably was a simple CNC machines which were used in a static factory setups to cut a piece of metal. Even a simple pick&place robot for an assembly line was beyond the capabilities of the 1990s technology.
Today's robotics in the 2020s is much more powerful and this improvement can be explained with a paradigm shift. Robotics until the 1990s was organized with a closed system assumption. the idea was to treat a robot as as a microcontroller which runs a software in the batch mode. It was a mathematical and a computer science artifact which was controlled by deterministic algorithms implemented in a programming language like C/C++. The assumption in the 1990s was, that such a paradigm is powerful enough to create artificial intelligence. The assumption was that the existing tools like a 16bit microcontroller, a PID controller, a Kalman filter or a C compiler allows to build robots.
What the engenners didn't know was that the mentioned tools are equal to a dead end. Even with today's knowledge its not possible to build a robot with such an equipment. What is needed are different tools located outside of computer science which allows to build open systems. these advanced tools are:
- motion capture: a human actor demonstrates a movement for a camera
- grounded language, a vocabulary to communicate with a robot
- a multimodal dataset which stores mocap data and semantic annotation in a database
These tools were missing in the late 1990s. Not because of technical constraints, but because of missing understanding for the difference between open and closed systems. A robot can be built only by one of the principles: either the robot understands natural language or it doesn't. Either the robot can playback motion capture data or it can't.
The dominant reason why these advanced tools were missing in the 1990s is because they are located outside of mathematics and computer science. Motion capture has its root in biomechanics and in animated movies. It was introduced for Rotoscoping which allows to draw cartoons. While grounded language has its root in linguistics which is located in the humanities which is the opposite of mathematics.
In the 2020s computer science has redefined its own boundaries because the former restriction to mathematics and algorithm theory was not able to solve robotics problems. No matter which mathematical theory was applied to robot control, all of them failed. The dominant problem in robotics control is the state space explosion. A robot has many degree of freedoms and planning inside the error map of such a kinematics chain will need too much CPU cycle. There is no algorithm available which can search faster in the state space, but the mathematical perspective itself is the obstacle.
The inner working of a state of the art robot from the 2020s can be explained as a machine who understands English commands and has access to a motion capture database. These tools combined allows the robot to solve complex problems like biped walking and grasping objects. From an AI perspective, the intelligence of the robot isn't encoded in a computer program but the intelligence has its origin outside of the robot, namely motion capture data and verbal commands. The robot is reduced to a minimal device which executes an existing trajectory with the servo motor and is converting a command into action. For example, the human operator may say "move with trajectory #12", after fetching the trajectory from the database the robot activates its servo motors. Strictly spoken the intelligence has its origin not in the robot but the intelligence comes from the environment namely the human operator.
Robots constructed as open systems can be seen as communication devices instead of computing devices. They are not running a program similar to a Turing machine but they parsing a message similar to a Telefax machine.
