March 01, 2022

4 From Teleoperation towards autonomous control

 


The base line in robotics is manual control with a joystick. Such a control technique works always, even if the Artificial Intelligence module isn't working or wasn't created yet. What AI Researchers are usually trying to archive is to overcome teleoperated control with a more advanced autonomous system. Such an interaction mode is equal to creating an Artificial Intelligence or game AI for short.
From a technical perspective such goal can be reached by converting sensor signals [4a] into a cost function. A sensor signal is something which can be measured with a light sensor or a tilt sensor. It is mostly displayed on a dashboard which is equal to an instrument panel used in cars. The dashboard shows all the sensor readings on a single place. In the second step this information is used for a judgment in terms of good or a bad situation. For example, if the robot has reached the goal position this is desired but if he has collided with an obstacle this is a mistake. The judgment is usually expressed on a numerical scale. Low costs are equal to a value around 0 while high costs have a value of 1.
In the literature the principle is mentioned as the grounding problem. Grounding means to feed the sensor signals into an instrument panel and use this instrument panel to generate the cost information. The cost information can be used to determine the optimal action for the robot. [4b Grounding with an integer array]
> 4a Sensor readings as python list
 

4a Sensor readings as python list

A default datastructure for storing sensor readings is a python dictionary.
dictionary = {"speed": "30mph", "throttle": 100, “temperature”:”20 degree Celsius”, “direction”: “+14 degree north”}
From a perspective of an expert system these information are the facts delivered as input values. The task for the expert system is to transform the facts into a numerical cost information. This transformation allows to control the system.
There are two main problems available. First one is, that most teleoperated controlled robots are based on a camera signal as input feature. That means, there are no discrete information available which can be stored in a python dictionary, but the input information is a 1 million pixel large camera picture. And the second problem is, that it is unclear how exactly the sensor readings should be converted into a cost value.
Without answering these problems, the robot domain remains ungrounded. That means, an autonomous control can't established.
> 4b Grounding with an integer array
 
4a1 Sensor based grounding
Grounding tries to bridge the gap between teleoperated robots and reinforcement learning environment. The interesting situation is, that both mentioned things can be realized from a technical perspective but what is missing is to combine both. That means, to create the model for a certain teleoperated task.
Suppose there is an RC controlled car, how does look the RL model has to look like? Suppose there is a teleoperated robot hand, which model fits to this problem? All these questions are called the grounding problem and it is the core of Artificial Intelligence. In the previous sections of this post some attempts were introduced how to solve the grounding problem. But it remains unclear which of them are making sense. A sensor based strategy should be introduced next.
What all the robotics systems have in common is, that a certain amount of sensory information is available. Grounding means to interpret these signals with a model. Instead of programming an entire RL Model the idea is to focus only on the sensor model.
From a technical perspective a sensor signal seems to be trivial case. For example there is a proximity sensor available which produces a signal from 0 to 100. Then the information is stored in a variable. According tot he python prototype in section 4b3 the sensor information are equal to the game state of a system and are stored in an array. But perhaps there is need to process these information in depth?
What we can say for sure is, that without sensor information a system model won't work. The task of a RL Model is to predict future states of the system. And a state is equal to sensor array whcih describes the system. Not the agent stands in the focus, but the perceived sensory input for this agent.
Let us try to understand how sensor information are processed by a RL environment. Raw data from the simulation are measured, for example the distance from a robot to the obstacle, and then this feature is stored in an integer array as the game state. THen it remains unclear what the next step is. Can this feature be processed further?
One possible attempt is to convert an integer array into natural language. But it remains unclear how to do so exactly.

4b Grounding with an integer array

A possible pipeline to create a cost function works with sensor information which are feed into an integer array, the array is transformed into a text adventure and then a cost function is created for the text adventure.

Let us focus on the integer array because it is the inbetween layer. An integer array can be created in all existing programming languages and it needs very little amount of storage. A typical example is “a=[43,0,32,4,0]. The cells in the array are representing values from the sensor readings. For example, the first value “43” stands for the distance from the robot to the wand in front of the robot measured in centimeters.

The information are not only stored the array but they are needed by the text adventure engine as the current state. That means, the plot is told with the integer array as fact sources.

Unfortunately it remains an unsolved problem who to transform integer values into a game. This depends strongly on the individual domain. But the surprising fact is, that all the possible adventure games can be stored in a text adventure. For example, if a text adventure prints out the sentence “you're standing in room B”. Then the room information which is A, B, C is stored in the integer array for sure. And if the character is executing an action like “move forward”, the game engine will change the world state in the integer array. Even if the user assumes, that the game will understand hin because the natural language parser, the only thing what the game engine is capable of is to modify the integer array and print out certain sentences in response to the value in the array.

> 4b1 Game state
4b3

4b1 Game state

A minimalist text adventure contains of two rooms which are are connected with a door, a robot, some objects in the rooms and a goal. For example the robot has to find the key which is the desk, then open the door to the other room. Find another key which is behind the picture on the wall and then the robot can leave the room which is equal to win the game.

To formalize this text adventure in a computer readable format the text adventure has to be grounded in an integer array. Or to be more specific, the current game state is stored as a vector. On this vector possible actions which are numbered from 0 to 8 can be executed. For example:

[roomposition, haskey1, haskey2, desktopen, pictureonwall]

A concrete situation for the vector would be [0,0,0,0,0} and then the action #1 is executed. After executing the action, the vector has a different value. This kind of domain transformation sounds a bit unusual, because for the human user, the game is about a two rooms a key and lots of possible actions. While for the computer the game is equal to a numerical vector plus numerical actions which have no meaning at all. What the user will reed on the screen is only the rendered frontend. For example, in the vector the room position is “0” and this is translated by the GUI into “you're a located in the first room which is 4x3 meter and you see a desk in the middle”. The natural language description allows the human operator to get a visual impression of the scene. The sad reality is, that in the computer there is no such thing like a room, but it is only the number 0 which is stored in the first position of the game state vector.
>4b2  >4b1a

4b1a A simplified model for a text adventure

A model is a simulation for playing a game. In the easiest case the game state is stored in an integer array:
gamestate=[
  0, # position
  0, # haskey
  0, # dooropen
]
In addition some actions are available. like moveto(), opendoor(), takekey(). After executing an action the game state is changed according to the game rules. For example, the precondition for open the door is, that the robot has the key in the inventory.
Let us try to utilize this model in an AI situation. Suppose the robot has to win the game. In the “escape the room” challenge, the robot has do to some tasks in the room and at the end he can open the door with the key. With a model it is possible to analyze the game tree in a systematic way. That means, a graph is created for all the possible action sequences and this is equal to solve the quest.
The interesting situation is, that from a computational perspective the graph is not very large. It can be created and traversed even on a standard computer in less than a second. The only bottleneck available is, that the model is needed before. Without a model the robot isn't able to predict future game states.
The described model can be described as a high level example. In stead of describing the scene with pixel level coordinates or as a video stream, the game contains of three simple features which are stored in an integer array.

 

4b2 Cost function design with game states

A game consists of game states over a time axis. A game state is stored in an integer array which is equal to a mathematical vector. After executing an action the game state has changed into a new value. This is equal to play the game.
From the perspective of Artificial Intelligence it is important to know what the cost value of a certain game state is. For example the game state [0.2,2,0,1,0,1] is assigned to the cost value of 0.9.
> 4b3
 
4b2aTeleoperation with cost function
Teleoperation is the default method to control a robot. It gives the robot human level capabilities. That means the robot is able to solve any problem. The disadvantage of teleoperation is, that it is not directly an Artificial intelligence. There is no software in the loop which generates the actions, but the human operator is doing so.
So the question is how to overcome teleoperation in favor of autonomous systems. First thing to do is to describe how enhanced teleoperation looks like. Semi autonomous teleoperation is working with a joystick controlled robot but in addition there is a real time cost function in the background. that means, a computer judges about each situation and prints out the score if the current situation is good or bad.
Such a cost function helps the human operator to control the system easier and it is also the first steps towards autonomous robots. The open question is how to design such a cost function. A cost function is working with a model and the model is a machine readable description of the reality.
There are many strategies available how to create a model. In contrast to program it in Python the more efficient would be to use rapid prototyping tools. The idea is that there is some sort of software which has parameters and predefined models for all sort of robot problems. For example the human operator specifies if the domain is a maze problem or a walking robot, and then the software generates the cost function. The human operator can modify the domain with the gui so that it will to the needs.
The overall system can be labeled as a rapid protoypting tool for creating cost function. THe idea is, that the user can create the cost models in under 3 minutes with a few mouseclicks. So the overall workflow is not based on neural networks or similar strategies but it is working with rapid prototyping paradigm. It is only a detail question how exactly the rapid prototyping tool looks like and which sort of domains are supported out of the box.
 
 

4b3 The grounding problem in detail

Grounding is some sort of bridging discipline between manual control on the left side and reinforcement learning on the right side. Manual control is equat to teleoperted robotics and motion capture. Both subjects are well understood. In contrast the right side contains of model based reinforcement learning which is also well understood.
That means, there are two subjects available which can be realized with today's technology and the only thing what is missing is the bridge between both subjects. IN a single sentence grounding is about converting motion capture data into a RL model.
If an RL Model is given, the problem can be solved autonomously. The problem is that nobody knows how to bridge these disciplines. That means, the grounding problem remains unsolved and this prevents autonomous robots.
To make the missing bridge more clearly let us define what an RL model is. It is equal to an RL environment which contains of the following pseudo code:
class Simulation:
  self.state=[0,1,0.4,0]
  def step(self,actionid):
    if actionid==0:
       ...
    elif actionid==1:
       ...
    return reward
This Python template contains of all the important elements of a model: there is a game state, a list of actions and a reward function. Such a model can be solved with existing algorithms, like q-learning or A* search algorithm. The problem is to create such an environment for a certain domain. That means, there is an RC Car which can be controlled with a joystick and what is missing is the RL Model for this car.
> 4b4

4b4 Can the grounding problem solved with algorithms?

The assumption is, that similar to all computer science problem, the grounding problem has an algorithmic solution. In the case that a certain software or program is needed and then the issue is solved. A closer look into the existing literature has shown that this attempt is difficult or nearly impossible. This makes it likely that the grounding problem can't be solved at all. The only way in dealing with it is a manual software development workflow which contains of online repositories and issue tracking systems.
Let us mention first some existing approaching to bridge the difference between teleoperated control and fully automated control. What is mentioned frequently as a one fits to all technique are case based reasoning, keyframes, learning from demonstration, cluster analysis, evolutionary algorithms and model induction with tensor fields.
What all these techniques have in common is that they are difficult to explain and they have failed to create models. A model is basically a simulation for a certain aspect of reality. And the only instance which is able to do so are human programmers. What they are using for this task are very basic techniques like object oriented programming. This allows them to convert a domain into a reinforcement learning environment.
The mentioned advanced techniques like case based reasoning or cluster analysis may help to support the creation of models, but they are not powerful enough to manage the task alone. In case of doubt they can be leave out and the programmer creates the model somehow else.
Let us describe which elements are needed for a model. a game state, a list of actions and a reward function. These elements are needed for any model. How the programmer creates the model in detail depends heavily on the domain. Similar to programming in general it is not possible to automate the step further.
The working hypothesis is that the grounding problem is not a computer science problem, but it is similar to programming in general a creative process which is done by hand. Reusing code means simply, that programmer has created the model already and programmer 2 is using the same code again. This is what is done in the reality. That means somewhere has programmed a grasping model and somewhere uses this model for a planning problem.
Sure, from a philosophical perspective it might be interesting if a computer is able to program itself and if the robot generates it's own model on the fly. But this is not possible with today's technology, especially not with the mentioned techniques.
The take away from the grounding problem is, that every robot needs a model in the loop, otherwise the system won't operate. The creation of the model can't be formalized further. In most cases it has to do with writing lots of code lines and benchmark them in a robot competition.
> 4b5
 
4b4a RPG Maker
From an academic perspective, grounding is equal to system identification. For a certain domain the model is needed to simulate this domain. There is a slightly different perspective available to the same issue. Instead of calling the issue a system identification problem it can be labeled as game design. The interesting situation is, that many rapid prototyping tools are available to support game design.
The most popular is perhaps the RPG Game maker. It is a software written in delphi 5 and was first published around the year 2000. The software is basically a map editor which was improved with a collection of png images. This allows a beginner to write a role playing game in under 4 minutes. THe RPG game maker allows to avoid object oriented programming at all, but game creation is done with a GUI.
What makes the perspective interesting is, that after such a role playing game was created the domain is grounded. That means, the robot aka the character in the game can move in a map, and there are other entities for example objects which allows a basic interaction. Such a game is an acuate simulation of the domain so it is a model.
Let us try to describe rapid prototyping tools from a more abstract perspective. The shared understanding is, that game design is a programming task which is realized with man power. That means a human programmer has to invest certain amount of hours until the game is available. Rapid prototyping tools like the python programming language or the mentioned all inclusive game make tool are helping to reduce this time. That means, grounding has nothing to do with learning from demonstration, cost functions nor neural networks but it is has to with rapid prototyping. It has to do with measuring the lines of code and determine how long it will take until a certain domain is converted into a game.
 
4b4b Solving the grounding problem with rapid prototyping
Grounding means to convert a domain into a model. Grounding is often described as a very complicated challenge because existing tools are missing. A typical tool chain for solving the grounding problem is a neural network used for system identification. At least, this is the understanding in some of the literature.
But there is an alternative approach available which assumes that grounding is equal to rapid prototyping. Rapid prototyping tools were first introduced in the gaming industry to fasten up software engineering but the same principle can be adapted towards robotics problems as well. Popular rapid prototyping tools are solid works, matlab, COSMOSMotion and of course Simulink. What these programs have in common is, that similar to the RPG Game maker there were programmed for the Windows operating system and allows to create complex new software with a few mouse clicks.
So let us describe how a robot model is created with a rapid prototyping software. The user starts the program, is using an existing template, presses some buttons, defines the geometry and basic physical parameters and then the model can be executed. Sure, in the reality the workflow is a bit more complicated but in general this is how rapid prototyping works.
So it seems, that grounding has nothing to do with cognitive architectures, natural language understanding nor expert systems, but it has to do with reducing the amount of time. Instead of creating the source code for a model in a programming language, the idea is to click in a rapid prototyping tool and generate the same result. It is only a detail question how such tools are behaving.
Let us go into the details. The most basic robot contains of the inverse kinematics problem and a walking pattern. Writing the model for both domains from scratch will take endless amount of time, even if modern scripting languages like python are utilized. But in a rapid prototyping tool the same task can be realized much faster. The assumption is, that such a tool has a built in generator for IK domains. That means, the user selects in the menu that he needs a kinematic chain with 3 links and then the resulting robot arm is shown on the screen. The underlying model was generated on the fly and can be imported in other programs.

4b4b1 Grounding as rapid prototyping
Before an artificial intelligence is able to control a robot a model is needed for the domain. The model contains of a game state, possible actions and very important a cost function. The cost function allows the solver to find the correct action. A possible interaction with the solver would be that the robot should pick&place some objects and the solver determines which low level actions are needed.
Unfortunately, the requirement for such a robot control system is, that the model of the domain is available. A model is some sort of simulator which is difficult to program from scratch. Creating a model for any domain is called the grounding problem. Grounding means, to connect the internal simulation to the reality.
A possible answer for the grounding problem is a rapid prototyping software similar to a game design software. The idea has nothing to do with artificial intelligence itself but the focus is on the programming workflow. A good rapid prototyping tool reduces the amount of programming effort drastically. In the best case a newbie is able to generate its first model in under 3 minutes. So we have to ask how exactly such a prototyping tool has to look like. In most cases it is some sort map editor, robot editor which has some predefined widgets. Instead of programming the simulation in an object oriented language like C++ the idea is that the user can click in GUI and then the tool will generate the runtime simulation.
Let us make a practical example. The desired model for a robot domain is given in the following python sourcecode
class Environment:
  self.gamestate=[0.1.0,0,3]
  def step(self,actionid):
     if actionid=0:
       ...
     elif actionid==1:
       ...
     return cost
This layout is similar to what is used in the OpenAI gym framework. But this environment isn't programmed manual by the programmer but the code is generated with a prototyping tool. The tool interacts with a GUI with the user. That means, the user clicks on some buttons, and then the python code gets generated. This python code (=model) can be utilized by a solver to determine the low level actions for the robot.
 
4b4b2 The pinball construction set as an early example for rapid prototyping
In the mid 1980s the pinball construction set was published. It was not only a video game, but it was a graphical development tool. From a user's persepective it was some sort of level editor. The player / designer was asked to drag and drop icons into a 2d map and then the resulting pinball machine can be started.
What makes the software interesting that the construction set was basically a rapid prototyping system for creating pinball models. The idea is with the engine in the background the human designer can create a simulation for any pinball table in the world. It can be modified by adjusting the position of the elements in the game.
Normal games are working with a predefined scoring system. For example the player has to avoid obstacles or control a pinball machine. In the case of the rapid prototyping system there is an additional layer. The social roles of a player and the programmer of the game are blurred
The most obvious reason why the mentioned software was a great success was the fast interaction. It takes around 3 minute until the player has created his first pinball like game. This duration is much shorter than any programming language in the world can offer. Also it is much easier to arrangethe elements with drag and drop..
To understand what is unique with such a level editor we have to explain what the term system identification means within control theory. System identification is about create a simulation for a domain. The real world gets converted into a computer program. With such a program certain interactions are possible. System identification is known as a hard challenge within computer science. In contrast the pinball construction set has mastered the challenge easily. The only precondition is, that the domain has something to so with a pinball machine, than the user is able to create the model within minutes.
A more recent tool is “Mario Maker 2”. The interaction has much in common with the older pinball software. The user can drag and drop icons into the main map. The map consists of tiles and then the game can be restarted. This allows to modify existing levels or create new Mario levels from scratch. Similar to all rapid prototyping tools there is no need to use an object oriented programming language but the user interacts with the software only with a mouse. This allows to reduce the development time down to 3 minutes or less until the first game was created.
 
4b4b3 Learning from demonstration vs rapid prototyping
There are two major techniques available to solve the grounding problem. The grounding problem is about creating a model for a domain. Learning from demonstration is the first idea with this background. The idea is that an expert demonstrates the desired actions and this allows to create the model in the background. The problem with LfD is that no concrete algorithm or software is available. It works only in theory. That means the expert is doing the task, then the game log is recorded and this will produce the model somehow.
In contrast the opposite idea is to utilize a rapid prototyping software. Here the workflow is more concrete. Rapid prototyping means, that there is an app which has much in common with a game design tool. In the app the human specifies that he likes to create a traffic simulator. He specifies if the game contains of a 3d map or an isometric map, also he selects one of the lane layouts. After pressing the create button the entire traffic simulation is created by the software including the cost function for the car.
Such rapid prototyping tools can be realized more easily than learning from demonstration systems. Because in case of a Prototyping tool is is clearly defined what the aim for the tool is. It is only a detail question how powerful the software will look like. If the program is only basic, the user can decide only between predefined traffic games without much parameters. That means, all the simulation will look the same.
Perhaps it makes sense to define, what exactly a rapid prototyping tool is. It is a replacement for handwritten code. In contrast to writing 2000 lines of code in C++ the user clicks in a GUI menu some buttons and this will create a game or a simulation.The resulting model can be used to play the game or use a solver to find the optimal actions within the game.
4b4b4 Rapid prototyping in detail
It is a bit complicated to define what rapid prototyping is. In the easiest case it is a level editor for an existing game. For example the famous “incredible machine” game has a bullt in level editor which allows the user to create new levels from scratch.
From a technical perspective such a tool is not very advanced. It is written in mainstream programming languages like C/C++ and the only interesting feature is, that it is using a GUI. What makes rapid prototyping interesting is that the workflow for potential users is different. The idea is, that the user can create his own game or level within minutes. That means, if the user is able to create en entire level in under 5 minutes without typing in source code, than the application is for sure a rapid prototyping tool.
The most advanced examples for rapid prototyping is available in the video game domain. There are en endless amount of game construction kits and level editors. In the easiest case the user can drag and drop the icons into the map and this allows to create levels for jump'n run games, racing games and point&click games. In most cases, a level editor is easier to use, that entry level programming languages like python. Python works compared to C/C++ more beginner friendly but it stays within the programming paradigm. In contrast, level editors allow to fasten up the design process drastically.
The typical level editor has two important elements: First the program provides a gui and second, the user can create it's own game in less than 5 minutes.
 
4b4b5 Rapid prototyping of animation
In the literature around Learning from demonstration and neural networks there is large misconception available. The assumption is always, that autonomous robots have something to do with mathematical and/or physical terms. It is located – at least this is the assumption – in the domain of natural science and therefor the established terms like differential equations and system theory have to be utilized to talk about robots.
The alternative perspective is established from a design standpoint. The question is how to create a robot system in less than 5 minutes. What is needed is some sort of software which has some predefined models. Let us go into the details.
The user starts the software and opens a model for a walking animation. The model specifies a stick man figure and provides also the keyframes of a motion graph. What the user has to define is only the sequence. For example, the user decides that the keyframes #2, #5 and #3 should be played back in serial order. After clicking the animate button the movie gets rendered. The overall interaction of the user with the software took less than 5 minutes and the result is an animated humanoid robot.
It is only a detail question how exactly to program such a powerful tool. The more interesting challenge is to define such software from a users perspective. The requirement is, that the user can interact graphically with the software and very important that he needs less than 5 minute to create a longer complicated animation.
The idea of storing keyframes in motion graph is technically easy to realize. The new thing is that such a storage makes sense to creating virtual humans. To develop this idea further the idea is, that not the user itself has to create the model, but somebody else has done so in the past. That means, the animation model is already there and what the human user is asked to do is select certain keyframes he likes to animate after each other. Then, the software creates the in-between pictures and generates the entire .AVI file which can be played back on the screen.
The perhaps most interesting situation is, that such a pipeline has nothing to do with natural science nor mathematics. The workflow doesn't fit into the established terms but it has to do with a certain sort of computer interaction. It is similar to visual programming which is adapted to computer animation.


4b5 Animation model for grounding

In section 4b3 it was explained how a Reinforcement learning environment looks like. It is a python class which contains of a gamestate, actions and a reward. The question which remains unanswered was how to create such a model for a certain domain.
Sure from a technical perspective such a model has to be programmed or learned somehow. In case of programming the underlying technique is called object oriented programming. But OOP itself is trivial, especially if it is used to create python programs there is only little need for advice. OOP means basically to write computer code which contains of classes and methods. OOP doesn't explain how to do so for a certain domain.
In section 4b4 it was questioned already, if the grounding problem can be solved with any sort of algorithm like case based reasoning or evolutionary algorithm. The answer was no. Because these techniques are not powerful for creating realistic models.
A possible alternative over an algorithm is to focus on a single domain which is computer animation. The working thesis is, that grounding is equal to create an animation model. This understanding opens up the perspective to a new and very large domain which is animation in general. Animation means basically to produce movies from still images. The interesting situation is, that animation and also computer animation has a long history and lots of books are published about the subject.
Within the literature there is an interesting subdomain available which is behavior animation. This is equal to create physical and kinematics models. Exactly this is needed to ground a domain in a model.[Tu21994]
Let us go a step back ward. The challenge is how to solve the grounding problem. The answer is, that computer animation is responsible for answering the question. That means, computer animation experts are suggesting how to create models for a certain domain.
 
> 4b6
 
4b5a Memory based animation models
Models are used by computer scientists to convert a domain into a machine readable simulation. The idea is to predict and ground parts of the real world. The model is an abstraction mechanism which can be converted into a computer program.
There are many concepts available to create models. Object oriented programming and animation languages are two of them. The problem with both paradigms is, that they are focussed on the needs of computers but are hard to realize for a certain domain. The idea of using an animation language to create a robot control system sounds great on the first look but it remains unclear how exactly the language has to look like.
A possible alternative is use a taxonomy of key poses.[De2017] A key pose taxonomy is working different from an animation language and is more oriented on the domain. Such a taxonomy can be created from a motion capture recording. It is not a programming language but it is some sort of database or table. In the table the key poses are sorted by groups.
It seems that a data oriented modeling is much easier to realize than programming oriented approaches namely finite state machines. A data oriented model is equal to collect information from a domain, store the data in a database and then cluster the information into a taxonomy.
Unfortunately a model needs to be executable computer code. Because the model is used as a simulator which is similar to a physics engine. That means, the model has to be created in the python or C/C++ language for sure. And, data can't be executed so it is a poor choice for a model? Not exactly, because a data oriented model helps to divide the modeling task into subproblems. 1. create the data model 2. create the model in source code.
[De2017]
 

4b6 Shadowing movements of a robot

Shadowing is a new idea which has to do with learning from demonstration. Before this grounding strategy can be introduced let us explain first what simple teleoperation means. Teleoperation is equal to real time control of a robot hand. The human operator has a motion capture glove or a joystick and this device controls the robot hand.
The open question within robotics is how to convert a teleoperated task into an autonomous one. A possible idea is to increase the delay between a human demonstration and the robot's movements. In a teleoperation setup the delay is under 10 msec. That means, the human is pressing the joystick and the robot is executing the motion. Shadowing means, that there is a time delay available. The human operator is doing something and 2 seconds later the robot repeats the actions.
The duration in between is stored in a buffer. For example a hierarchical trajectory database. These buffer oriented movements are working with the learning from demonstration paradigm. There are more than a single option available how to use the information from the buffer to control a robot. Let me give an example.
Suppose the delay is 2 second. The result is that the trajectory in the buffer contains these 2 seconds of delay. And now, the question is how to use this information to control the robot hand.
IN the maximum case, the human doesn't need to demonstrate anything but all the movements of the robot are created with the buffer. This results into an autonomous robot. Realizing such a system is much harder than only a robot arm who is operating with a buffer of 2 seconds.
4b6a Delay for remote control
The default situation in teleoperated robotics is to minimize the latency. The goal is, that the robot arm acts immediately and all the technical effort improves this goal.
The surprising situation is, that a longer duration the resulting robotic system is more interesting. The delay is not solving any problem but it creates a challenge. The challenge can be described the following way. There is a human operator who controls the joystick but the robot arm is moving 5 seconds behind the demonstration. Such a situation generates lots of questions. For example it is possible to compress the information in the buffer. another idea is to use a database with similar demonstrations. What all these techniques have in common that a delay is forcing the robot control system to use a buffer. In the simplest case the buffer is a queue datastructure which stores the trajectory for a certain amount of time. New waypoints are added on the left side, while the robot takes the next waypoint from the right side.
The interesting situation is that the duration in seconds allows to scale up the system. In the trivial case the buffer has a length of 0 that means no latency is there. Such a system is equal to a teleoperated robot. In the advanced case the buffer has a duration of 1 hour which is equal to autonomous control. that means, the robot movements are not connected directly to the human operator anymore.
There is a subproblem available within shadowing robots which is to measure the error between two demonstrations. The idea is that two human operators are doing the same task and the robot has to judge how close the actions are. In such a case the robot itself has no obligation but only the humans are executing actions. So the only technical problem has to do with perception of the environment
Similiar to delayed teleoperation such a technique doesn't solve any existing problem but it introduces a new AI problem which is interesting to analyze further That means there is a certain algorithm required which is able to measure the difference between two trajectories..

4c A general look towards a Robot control system
A modern robot control system isn't oriented on mathematical understanding of the world but it is located in the arts. The purpose is to create an animation [4b5] so a key-frame generator is the core of the system.[Mi2021] [Th1989].
A common issue is how to generate the animation for a variety of domains. The needed motion for a self driving car looks different from the motion of a pick&place robot. The answer is to use rapid prototyping GUI tools.[4b4b] The idea is that the user clicks in a GUI on buttons and in under 5 minutes the animation model is created from scratch.[Do2003] [Ch2012].
To plan inside a model a cost function specifies the transitions between the keyframes. [4b2a] Also it takes care of obstacles in the environment. Cost constraints can be specified in natural language commands.[Ko2011] [Ya2021] [Ko2014].
 
4c1 Robotics automation 1x1
Understanding robotics means at foremost to know which sort of problems can be solved easily with a computer and which one are harder to realize. This understanding allows to define the priorities and discuss open problems in depth.
At first we have to know two easy to automate tasks. The one is a teleoperated robot and the second is a model based planner. Creating a teleoperated robot means to provide hard- and software which controls the robot with a joystick. Such a pipeline is documented very well all over the internet. The second technique (model based planning) is less common but can be realized also easily.
Model based planning means, that for a certain domain for example, the 15 puzzle a forward model is avialable which includes the cost function. And the planer determines how to solve the puzzle. The algorithm is mostly realized in C/C++ and will find the optimal trajectory very fast.
Even if both techniques are working quite well something is missing in the domain of robotics. And because of this extra part, it remains a challenge to realize a robot. What is not available is a more general approach in which a robot is able to solve a certain new problem which has no model yet.
For example, there is a robot arm and the robot arm should grasp a bottle. Such a problem is difficult to realize. Because it has nothing to do with teleoperation and it has also nothing to do with model based planning. Or to be more specific, the robot should work autonomously, and no model is available for the grasping problem.
Because of this combination it is difficult or even impossible to write an algorithm for this issue. So we have found a hard to solve problem which stays outside of today's capabilities.
After this introduction it is clear what the challenge in modern robotics is, and now we can try to solve the issue. The reason why the grasping problem is difficult is because of the contradiction between what today's AI can solve and what the grasping challenge is about. Today's engineers can create teleopertaed robots for unknown domains and they can write solvers for well defined models. But, the grasping challenge is located within autonomous robots and has no model at all.
Let us go a step backward to understand the situation in depth. Creating a model for a task is called system identification. A more programming oriented term for the same issue is game design. Game design means to create a model. A model is some sort of simulation. So what is the model for grasping a bottle? Right, nobody knows. Even more advanced literature has no answer to this problem. It is located in advanced robotics and wasn't researched yet.
What is available are some techniques which are going into the direction of system identification. One powerful technique is motion capture. Motion capture means to record an animation. This is usually done with markers which have a position in the 3d space. The location is recorded over the time. In theory such a log file can be converted into a model later.
The most recent approach in motion capture is a model based motion capture tracking. Explaining this technique in detail is very complicated. It tries to solve the hen&egg problem. Even if no model is available, the assumption is, that there is a model available and this model allows to parse the motion capture data.
Let us describe the paradox situation briefly. The hope is to convert motion capture data into a model. For doing so we need at first a model. Then we can track the information. But we have no model yet, so we can't parse the information, right?
IT was mentioned before that the issue is highly complicated. One possible idea is to take a look at some existing technique which may help to understand the situation. Suppose the idea is to create an animated film on the computer. Then the open source software synfig may help to create such animation. Synfig allows to create a stop motion animation. The user defines some keyframes, and the software renders the resulting movie.
The synfig software was never developed for the purpose of robotics grasping but it might be utilized for this purpose. A synfig file is basically a model. The model contains of keyframes which are located on a time axis. Question: is it possible to create a short grasping animation in synfig? Oh yes, this is an easy task. If such animation was created, the keyframes can be converted into a model. And this model may help to parse a motion capture demonstration.


The screenshot shows a simple model for a grasping task. It contains of 4 picture which are provided as index cards. Such a model can be converted into a computer program and then it helps to parse a motion capture demonstration. THen, a more elaborated model can be created.
 
4c1a Why modeling is important
Computer engineers have developed a variety of tools for solving all sort of problems. Neural networks, database and complied programming languages are able to create software in general which can be run efficient on recent multi core processors. In addition, lots of input and output devices for example, sensors, high resolution monitors and actuators are available. From a pure computational perspective, lots of progress was made in the last decades.
The interesting situation is, that all these gadgets are useless or have only a minor priority for solving robotics problems. Basically spoken, it is possible to use the mentioned tools correctly but the robot won't do anything useful. This mismatch is caused by the grounding problem. If the domain wasn't grounded yet, there is a gap between the available computing tools and the robotics task.
The grounding problem can be solved easily by modeling. Modeling means to create a simulation for a certain problem. In most cases this is equal to a short term motion model and a long term task model. If these models were created, the models can be run on existing computer hardware and they can be combined with databases and input devices.
Because the topic of modeling is important for robotics let us describe the issue in detail. From a technical perspective a model can be realized in software. It is source code which solves a certain issue. Or to be more specific the code formulates the challenge in a machine readable way. The most basic form of a model is constraints. For example the equation “x+2<5” is such a constraint. It formalizes a mathematical problem and then it is possible to solve the quest.
The interesting situation is, that solving the equation is easy because a short look will show that that x is smaller 3”. This can be determined by a human mathematician or a computer program as well. The more demanding task is to define the constraints first.. That means, someone has to invent a mathematical problem and solving it is the easier part.
In contrast, computer programming works the other way around. The self understanding of computer engineers is to find algorithm which solves a problem. Lots of books and libraries were created for implementing all sort of algorithms. The problem is, that if no problem is there, any algorithm is useless. And because of this situation, robotics is a complicated domain.
Let us take a look back how robotics has evolved over the decades. The most important breakthrough in robotics was not the invention of neural networks and it wasn't the advent of the RRT Sampling based algorithm. But the revolutionary milestone was the invention of the micromouse challenge. The micromouse challenge specifies a robot problem which has to be solved by different participants. It explains to the audience that there is a maze and a robot and that the robot should do something which is measured in seconds. So we can say, that the micromouse competition is some sort of model or a constraints, and then the challenge has to be solved. Solving the challenge which includes to build the hardware and write the software for the robot is a trivial task. Anybody can do so and lots of examples form the past are available. The more hard to grasp problem is to invent competitions similar to micro mouse, this is the real grounding problem.
The task of creating a model for a domain is a bit esoteric but it is not completely unknown in traditional mathematics. There is a large amount of tutorials available for the “game design” problem and also for the problem of converting a textual problem into a mathematical formula.[Br1983]
Word problems are usually treated with https://en.wikipedia.org/wiki/Mathematical_model A mathematical model isn't answering the problem itself but it converts natural language into a mathematical equation. And then the problem is solved by mathematical tools.
This principle can be utilized for the symbol grounding problem too. Or at least it is related to it.
The interesting situation is, that there is no single strategy for creating a mathematical model. The reason is, that grounding is located outside of mathematics. What might help to understand the situation is a missing model. Suppose a robot should do something but there is no model available for the domain. That means, it wasn't specified yet what the goal is, what the subgoals are and which possible states are available in the robot game. How can the robot be programmed to solve the task? Right, there is no way in doing so. Without a precise model, the optimal trajectory can't be found. That means, the automation problem will fail for sure.
According to the more recent robotics literature from the last 20 years there are some strategies available which are solving partly the grounding problem, namely motion capture, teleoperation and manual game design. All these strategies can produce a model. They are not located in mathematical nor computer science itself, but they have more in common with a workflow which is executed by humans. Not the robot has to do something, but the human has to run a motion capture demonstration and manual convert the raw data into a model.

4c1b Well defined problems
The main reason why robotics is hard to realize is because of the problem aren't specified well enough. Let me give an example. Suppose there is a graph available which contains of costs and some nodes. The goal is to find a path in the graph from the current node too the goal node by minimizing the costs. Such a task is well defined and can be solved easily with a computer.
The problem can be formulated in matlab or in python and if the programmer struggles in some detail issue he will find for sure help in online forums. Unfortunately, most robotics problems are not from this category. But a typical robot problem looks the following way:
1. there is no graph and no other datastructure available
2. what is available is a robot and a task given in natural language
3. after pressing the start button the robot should clean up the kitchen or drive on a road in real traffic
Solving such a problem with a computer or even formulate the solving algorithm is not possible. In contrast to the previously mentioned problem, its located outside of well defined defined problems. It is an AI problem not specified well enough. If the programmer tries to solve it or even asks in an online forum for help he won't be successful. That means, other programmers doesn't know how to solve it as well.
The reason why it is a bit complicated to explain. It is not about solving an existing mathematical problem, but the problem is, that the problem wasn't formulated precisely. A precise defined problem is usually equal to a model. If a model is available which includes constraints, then the problem can be solved with mathematical tools. And in addition it is possible to ask other programmer how to implement the solver in detail.
What is needed for most robotics is to define the problem as a model. A model is always an invented situation. It is not available as default but modeling is located on a higher layer.
4c1c From a problem to the model
The main reason why it is hard to program robots is because of missing understanding what the objective is. Today's engineers have access to the latest hard- and software which includes modern operating systems, programming languages and endless amount of disc storage but they have no idea how to apply this technology to robotics problems.
To describe the AI problem in detail we have to take a more abstract look towards the problem of robotics. What all robotics problems have in common is that a domain is mapped to a model. The domain can be a self driving car or a soccer playing robot. The interesting situation is, that a domain can't be programmed or solved directly. The missing connection between the domain and the model is called the grounding problem or the abstraction mechanism. If this link is not there or broken, than computer engineers are struggling to program a robot, which is the default situation in the now.
In contrast, a model for a domain can be treated easily in software. Models have the advantage that they can be simulated in software, future states can be predicted and very important they can be solved. A good and easy to example for a model is the 15 puzzle. The model for this problem fits in under 30 lines of code. It contains of possible movements in the puzzle and the objective what the goal is.
In general models are located within computer science, while rough domains are located out side of computing. So the question is how to map a domain to a model? Nobody knows, because the grounding problem remains unsolved.
But even it is impossible to create models, the previously description gives an idea why robotics in the past have failed. The explanation why a certain robots collides with the wall in a maze is because something with the model is wrong. From a support perspective it is possible to advice any robotics engineer to improve it's model.
Most techniques in robotics have to do with modeling. Mostly it is not described this way. For example the engineers are talking about motion capture, neural networks or natural language. But what the debate is really about is how to map a domain to a model. This debate is highly complex and it is much harder than solving ordinary puzzle. A normal puzzle has the advantage that the model was defined already. For example in a chess puzzle there is a board which contains of 8x8 squares and then a certain task has to be fulfilled.
In contrast, in model building and game design no such puzzle is available. That means the problem is located on a higher abstraction level which remains invisible for the researcher.
[Erkut2000] [Gervautz1994]
 
4c1d Abstraction mechanism 1x1
The main reason why it is hard to program robots is because of missing words for describing the issue. The existing computing terms which were created with the advents of microelectronics doesn't fit to the AI Domain. Terms like operating systems, transistors, 8bit CPU and even theoretical terms like hash table, sorting and search algorithm doesn't fit well to the needs of Robotics programming. Without a language for describing AI it is not possible to figure out possible answers, and this remains AI remains unexplored.
So let us go a step backward and ask the bold question: about what we are talking about? Robotics programming is mostly about finding abstraction mechanism. Abstraction mechanism means to convert problems into models. If the model is available, then it is possible to discuss the model with computational terms which have to do with programming, fast compilers and algorithms. The problem is, that what is called an abstraction mechanism is a very vague description of what AI is about. It is some sort of mapping and the amount of literature about the subject is low.
So basically spoken the topic wasn't invented yet and because of this reason the terms are not available. Let us search for an academic subject which comes close to the abstraction idea. Under the term “Model predictive control” lots of books are available. MPC has to do with control theory and the idea is to use a model in the loop to improve the control of a system. What remains open is how to create such a model. Creating a model has much in common with game design, that means it is not available in nature, but an artist has to create it from scratch.
What Model predictive control is discussing in depth is how to control a system if the model is already there, without a model no MPC controller can be created. Because of this limitation, MPC is not the core of an abstraction mechanism but it is something which is understood already.
Another existing topic which comes close to the abstraction is a differential equation used in mathematics. Differential equations are used by physics student to describe real world systems. With such a quation it is possible to model the dynamics of a car or describe the weather. A differential equation and a model is the same. But modeling has more options than using only mathematical description of the world.

4c2 Mapping motion capture to a pose taxonomy
A motion capture recording produces numerical information of the marker. The position of the marker is tracked in realtime and can be stored into a machine readable CSV file. Even if the information is accurate on a millimeter level something important is missing. There is meaning available about the poses and the trajectory remains ungrounded.
To overcome the obstacle the mocap data has to be matched with a model. The model is mostly a pose taxonomy. In the pose taxonomy some keyframes are stored for example 24 of them. With this model in the background each mocap picture can be assigned to one pose in the taxonomy.
This allows to convert the raw data of the marker into meaningful grounded information. The capture data creates a pose sequence over the time axis for example:
[pose #2 “standing”, pose #18 “sitting”, pose #15 “sitting”]
Instead of analyzing a state space which can has million of possible poses, the state space is reduced to 24 possible posses which are given by a template database. In addition the taxonomy is of course structured in a hierarchical fashion which allows to reduce the categories further. At each situation the mocap data can be labeled with one of the four main categories and then subcategories can be determined.
see also:
- [4b3] The grounding problem in detail
- [4b6] Shadowing movements of a robot
[Bo2015]
[Ma2013]

4c2a From a body pose taxonomy to a keyframe generator
Instead of analyzing what a taxonomy is by it's own understanding let us try to introduce how to use it in a concrete situation. Suppose the user likes to create an animation for a biped walking robot. The goal is to create a sequence of 4 single pictures. Such a task can be realized with a body pose taxonomy easily.
What the user has to do is execute a method in the program which is “generationanimation(8,9,4,5)”. The numbers given as parameter are the keyframes from the pose taxonomy. Behind each number there is a concrete pose stored. After executing the method, the program delivers the pictures which are stored as a list of points. These point list is rendered to a graphical representation in a video.
The pose taxonomy is basically a short notation for specifying an animation. The idea is that the user enters only a list of IDs and these IDs are converted automatically into full blown keyframes which contains of many single points. These points are located in the 2d or 3d space and are representing the pose.
So the main task of a taxonomy is to assign a pose number to a concrete pose. In addition, the pose numbers are ordered in a hierarchical table of contents. In advanced taxonomies there is even a transition matrix available, for example the sequence from 8 to 9 is allowed while the sequence from id #8 to #1 is forbidden. If the user likes to render such a sequence he gets an error in return.
 
4c2b Pose database as an abstraction mechanism
The term abstraction mechanism describes a workflow in which a problem is converted into a model. For example, there is a line following robot in the real world and the goal is to simulate the robot in a computer program.
Creating such models is one of the hardest problems within Artificial INtelligence, and missing models are the main reason why today's robots struggling in solving tasks. In contrast, an existing well grounded model results into a working robot control system, similar to a 15 puzzle solver program which is able to solve the task much faster than any human can do.
So the basic question is how to create a model? A model aka a simulation is equal to a physics engine, this is a part of the software which can predict future system states. Unfortunately, physics engine are hard to program. Most existing physics engines like Box2d are working with differential equations and lots of mathematics in general. A lot of experience is needed to create a forward model for a certain domain.
A possible way to overcome the obstacles is by create first only a data driven model. Data driven means that the model can't be executed because it doesn't contains of source code. IN contrast, the model is equal to a database file. The idea is, that in the second step the database is converted into an executable physics engine.
Creating a database file for a domain is much easier than programming a full blown physics engine in C/C++. A database is mostly stored in a plain text format or as a table. The information is taken from motion capture device and game logger tools. A typical example is a body pose database. The idea is to create a list of 24 predefined poses and store this list in a database file.
Then a ranking algorithm is needed to retrieve the information. For example the user defines a certain pose, and the most 3 similar entries from the database are returned in an ordered fashion. Such a system is a basic example for a data driven model.
The original idea was to create an abstraction mechanism. A database can be seen as a model of the reality. That means, everything what is stored in the database is part of the domain. Instead of analyzing the domain itself, the interaction takes place with the database. That means from a computer perspective the world contains of the database which has stored 24 body poses. The main advantage is, that this world understanding reduces the complexity dramatically. It helps to understand which sort of software has to be programmed next. The software doesn't need to become intelligent in the sense of Artificial INtelligence, but the software is simply a database query algorithm.[Ch2012]

References

[Bo2015] Borras, Júlia, and Tamim Asfour. "A whole-body pose taxonomy for loco-manipulation tasks." 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2015.
[Br1983] Braun, Martin, et al., eds. Differential equation models. Vol. 1. Springer-Verlag, 1983.
[Ch2012] Choudhury, Safwan, Derek Wight, and Dana Kulič. "Rapid prototyping toolchain for humanoid robotics applications." 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012). IEEE, 2012.  
[Ch2012] Chen, Tao, et al. "Poseshop: Human image database construction and personalized content synthesis." IEEE Transactions on Visualization and Computer Graphics 19.5 (2012): 824-837.
[De2017] De Smedt, Quentin, et al. "Shrec'17 track: 3d hand gesture recognition using a depth and skeletal dataset." 3DOR-10th Eurographics Workshop on 3D Object Retrieval. 2017.
[Do2003] Dontcheva, Mira, Gary Yngve, and Zoran Popović. "Layered acting for character animation." ACM SIGGRAPH 2003 Papers. 2003. 409-416. 
[Erkut2000] Erkut, Cumhur. "Abstraction Mechanisms in Computer Art." Seminar on Content Creation. 2000.
[Gervautz1994] Gervautz, Michael, and Dieter Schmalstieg. "Integrating a scripting language into an interactive animation system." Proceedings of Computer Animation'94. IEEE, 1994.
[Ko2011] Kollar, Thomas, et al. "Towards Understanding Hierarchical Natural Language Commands for Robotic Navigation and Manipulation." (2011).
[Ko2014] Kollar, Thomas, et al. "Grounding verbs of motion in natural language commands to robots." Experimental robotics. Springer, Berlin, Heidelberg, 2014.
[Ma2013] Marcos-Ramiro, Alvaro, et al. "Body communicative cue extraction for conversational analysis." 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). IEEE, 2013.
[Mi2021] Mittal, Mayank, et al. "Articulated object interaction in unknown scenes with whole-body mobile manipulation." arXiv preprint arXiv:2103.10534 (2021). 
[Th1989] Thalmann, Daniel. "Motion control: From keyframe to task-level animation." State-of-the-art in Computer Animation. Springer, Tokyo, 1989. 3-17.
[Tu1994] Tu, Xiaoyuan, and Demetri Terzopoulos. "Perceptual modeling for the behavioral animation of fishes." Fundamentals of Computer Graphics. 1994. 185-200. 
[Ya2021] Yang, Tsung-Yen, et al. "Safe reinforcement learning with natural language constraints." Advances in Neural Information Processing Systems 34 (2021).

Index

Animation [4b5]
Grounding [4b]
Game state [4b1]
Instrument panel [4]
Micromouse [3b]
Petrinet [3a1c]
Production line [3a1e]
Reward function [3a1]
Sensor reading [4a]
Tetris [3a1g]
Text adventure [3e] [4b1a]