An abstraction mechanism is a middle layer between a robot and the domain. It helps to simply the state space. From a technical perspective, such a layer can be realized as a database. The structure is organized in a taxonomy and this json file holds no source code but only textual information. This model is used to retrieve body poses, waypoints and maps.
1 1 Artificial Intelligence in the past
Instead of explaining how to create robot control systems which are working with motion retrieval and GUI interaction let us take a look back into the past who AI was thought 30 years ago. From today's perspective this understanding was looking not very powerful but in the 1990s the AI pioneers were convinced that this is how intelligence is working.
In the past there were two major assumptions available what artificial intelligence is about. The first one was that the system works autonomously and the second was, that AI is equal to a certain algorithm. Let us take a closer look into both biases.
An autonomous robot is the opposite of teleoperated robot and the opposite of interactive control. It means simply that no human is in the loop but the system processes all the information by itself without intervention from the outside. This is done not by a human or a human organization but the computer or the robot is doing so only with hardware and software.
The second idea was that the exact way how decision are made is determined by a computer program or an algorithm. The algorithm contains what is known as artificial intelligence, it is a certain way how decisions are made. The time until the 1990s, this paradigm was seen as the only way for realizing artificial intelligence. Instead of creating robots in general the idea was to create algorithm controlled autonomous machines.
The idea was to invent some sort of robot which doesn't obey to nobody but the robot is equipped with an advanced algorithm, and this algorithm is executed on the main cpu. This algorithm decides what the robot is doing next and produces artificial intelligence.
From today's perspective this development idea looks funny. Because if the robot acts by it's own it can profit from higher intelligence which is located in the environment, and if decisions are made by algorithms the CPU is utilized very much. Creating a robot in such a way, will produce an np hard problem which has no external constraints. That means, it will occupy all the cpu resources and is not doing anything useful. But around the 1990s nobody has thought this way. The reason is that imagine autonomous algorithmic controlled robots is the most easiest form of artificial intelligence.
The reason is, that with this definition in mind, the system works similar to a computer with software and in addition there is no need for a human operator but the ai controls itself.
I've explained the understanding in the past in detail because it helps to recognize why today's AI is different from their counterpart in the past. More recent AI projects are using exactly the opposite paradigm. Current AI projects are interactive and they are data driven. That means, autonomy and algorithmic control is no longer a goal but the idea is to realize robots somehow else.
But let us go back to the idea of autonomous algorithmic controlled machines. The idea has a lot to do with programming a computer. In the computer's main memory there is the program stored. The program can be realized in C/C++ or assembly language, and the program is doing something. This is equal to run an algorithm. The inner working of the program depends mostly from the computer and the operating system. There are subroutines, jumping labels and a mathematical processor to calculate something.The overall system can be described as a turing machine. And if someone has understand how such a virtual machine is operating he has grasped the secret of artificial intelligence too. At least, this was the understanding 30 years ago.
Perhaps the comparison with a Turing machine makes sense to explain what AI was imagined in the past. Suppose a Turing machine is running a sorting program. The inner working is based on two principles. First, it is an algorithm which is executed on the Turing machine and secondly this algorithm is completely autonomous. That means, after starting it the algorithm will sort all the 1000 elements in the array and then it stops. This understanding was seen as a valid description what Artificial intelligence is about. The idea was, that AI is some sort of sorting algorithm which is running on a T uring machine.
The reason why this understanding was available in the past is because it wasn't verified. That means, robots were not build with this approach but it was a theoretical idea. On the first look it looks realistic to imagine AI this way. That means, someone has to program an algorithm, then the algorithm runs autonomously and at the end it will allow a robot to walk forward and take decisions.
What the people 30 years ago doesn't know was, that this system design has a high failure probability. If someone programs a robot this way, the chance is high, that the programmer doesn't know how exactly to program the robot and if the project was realized the robot won't walk a single centimeter. This pre knowledge about failed robotics projects wasn't available in the past. So the understanding in this time was naive.
1.1 1a The grounding problem
Artificial Intelligence in the past was working with some assumptions.
Section 1 To summarize the ideology we can say, that in the past the grounding problem wasn't recognized as a challenge and therefor it wasn't solved. The grounding problem is about connecting an artificial intelligence with the environment and embedded it into the reality. For example that a robot is controlled semi-autonomous or that the robot has access to a body pose taxonomy with prerecorded motion.
Grounding means basically to ignore the computer science perspective. That means, the quest is not how to program a turing machine like algorithm which is able to number crunching,, but grounding means that AI and robotics have to do with natural language, computer animation and models.
The reason why grounding is hard to realize and wasn't recognized over decades not as a challenge is because all these things are hard to formalize with a computer and they are located outside of computer programming. Or let me explain it the other way around. In the entire C reference manual and also in the description about the x86 CPU which contains of 6 large books, there is not a single chapter about how to process natural language or even program a video game. The reason is, that these subjects are equal to practical computing. That means somebody is using the c language and a 32bit cpu for doing a concrete task.
1.1.1 1a1 Environment design
The grounding problem can be roughly translated with “missing environment”. An environment is a problem definition in which a computer algorithm is able to solve the problem. For example, the 15 puzzle problem is such an environment.
A possible strategy to create environments from scratch is game design and learning from demonstration.[Kirk2016] Both techniques are not located in computer science itself and there is no single algorithm or library available for this purpose but it has to do with problem invention. It is the same sort of task if someone invents a new board game. Instead of trying to play or win an existing game the idea is imagine rules and maps which allows other players to interact.
1.1.2 1a2 Body pose database
Explaining a body pose database is from a technical perspective surprisingly easy. It is an sql database or a json file which stores keyframes of a character, for example “stand”, “walk”, “sit” and so on. It is a table which holds some integer values and these values are the position and the angle for the skeleton.
Programming such a body pose database in python, java or any other programming language is trivial. The more demanding question is why exactly such a database is needed in a robot control system? The underlying idea has much in common with a map in a video game. A map specifies the environment in which the game takes place.[Watkins2016] Before the robot in the game can do something he needs constraints for example a wall or an interesting location in the map which needs to be discovered. Then, the robot take decisions and plan the path.
A body pose database and a 2d map for a video game have much in common. Both are an environment in which actions can take place. The idea of a map is, that the player can be one of the locations, while the principle behind a pose database is, to select one of the cases. It is some sort of abstraction mechanism which is working with addresses in the map or identifier in a database.
In video game programming these elements are sometimes called resources. There are many data-only files for a single game for example, maps, sprite sheets
Section 3, music, midi-sheets, story lines, text files which contains of dialogues and so on. What these resources have in common is the absence of programming code.
Section 2.2
1.1.3 1a3 Mental Models
In the history of Artificial Intelligence the idea of mental models and the agent's internal representation was introduced early. It was clear that there is a need to create a middle layer between a domain and a robot. The problem was/is how exactly such a model has to look like.
The answer to the problem is surprisingly easy. A mental model and a game is the same. And creating a model from scratch is similar to game design. The existing tools for creating video games can be adapted. Especially prototyping tools for creating levels, sprite sheets and story boards are the best practice method to create models in general.
It is important to know that in the AI research of the past,
Section 1 the similarity between mental models and video games wasn't known. Until the 1990s the disciplines of video game programming and discussing Artificial Intelligence were strictly separated. It took until the year 2000 until AI researchers have discovered the domain of video games because such restricted domains makes it easier to create Artificial intelligence. What all the projects about computer chess, Tetris playing bots and 15 puzzle solving neural networks have in common that they have demonstrated successfully that an AI is able to play these games.
This success can be seen as a clue, that video games are a good starting point to understand what AI in general is. The only unsolved problem is how to model a certain game or to model a lot of games with the same datastructure.
1.1.4 1a4 Abstraction mechanism
The grounding problem is complicate to describe and it is more complicate to solve it.
Section 1.1 In a short sentence it is about converting a domain into a simplified model which can be understood by a computer program.
One reason for the confusion is because the normal assumption is, that a programming langauge provides an abstraction mechanism already. Instead of writing down assembly code instructions, the programmer can formulate the domain in a high level language like C or python. Unfortunately, even high level scripting languages are not powerful enough to create models. There is a need to formalize a domain somehow else.
An early attempt was made for the subtopic of computer animation. [Webber1990] Around the 1990s the first projects were started to develop text to animation systems in the tradition of SHRDLU. The idea is, that the abstraction mechanism is natural language. The idea is that the programmer types in high level actions like “run” or “walk” and this allows to animate a character. The technical details of text-to-animation systems are only a minor problem. It can be realized in many programming languages, the more interesting insight is, the system itself. So the grounding problem is solved with a certain sort of computer program.
Because computer animation is easy to understand compared to expert systems and neural networks let us take a closer look how a textual interface is working. The idea is, that there is a vocabulary with motion words and each of the words will execute the desired animation. It is some sort of lookup table between words on the left side, and animations on the right side. This mapping can be interpreted as a model. Instead of talking about computer animation the idea is, that to focus only on the motion words. That means, the computer program has around 8 different words and all the animations are created within this vocabulary space.
If the user likes to create an animation which goes beyond the predefined vocabulary he can't because the model is limited to only these words.
1.1.5 1a5 Symbolic Grounding
Grounding is a concept from Symbolic AI which was famous in the 1980s. Unfortunately, it is hard to define what the idea is about. A rough estimation is, that it has to do with mapping a domain to natural language. This is mostly identical to an abstraction mechanism.
A more practical approach to define the grounding problem is by annotating a motion capture recording.[Guerra-Filho2006] The recorded motion is equal to numerical data in a .CSV file and the goal is to label the data with natural language. This annotation is usually realized with a model. The model defines which integer values are mapped to which words and categories.
1.1.6 1a6 Grounding without natural language
On the first look, grounding is equal to natural language processing.
Section 2.2.1 Section 3.3.1 The idea is, that the model which describes a domain on an abstract level is equal to formulate the situation in a sentence with words. With this understanding, grounding is equal to natural language processing.
But perhaps it is possible to ignore natural language at all? The working thesis is, that the core element of grounding has to do with database indexing.[White1996] indexing means to retrieve the ID for an entry, but not to label the entry with natural language. A typical example is a body pose database.
Section 2.4.3 Such a database contains a list of keyposes which are numbered from 1 upwards. In the database itself there is not textual description what each pose is about, it the information contains of numerical data for the body joints and the mentioned pose id.
Indexing and database retrieval means, that the database will return the entry id for a certain search request. If the body pose was found in the table, that it is grounded. The idea is, that grounding is equal to find something in a database.
Sure, from a robotics perspective it makes sense to take a look up in another database to find the natural language description for a certain id. For example id=3 means that the characters is standing upward. But, this natural language label is only an additional information and it is not needed for the grounding process itself.
Let us try to summarize the strict definition. The idea is, that a data only model for a domain is available which is stored in a database. The user can search in the database for an entry and gets the item ID back.[Grosky1994]
1.1.7 1a7 Annotation vs indexing
The term indexing was first introduced in the context of database creation. The idea is that a search index is created to fasten up the full text search. Later the concept was extended to visual databases and very important to motion databases. The workflow has to do with adding new items to an existing database. Adding means to store the object and update the index.
Unfortunately, there is no concrete algorithm available for the process. Very general approaches like a B tree index can't answer how to create visual databases. The logical consequence is to ask if indexing is perhaps not so important at all. The next likely technique would be annotation. Annotation works very different from indexing.
The idea is, that the database is already there and the database is requested for a certain task. During this task, the input sequence is annotated with information from the database. For example the input list is [2,1] which is a short notation for motion frames, and the database returns the full body pose for this request. The interesting situation is, that annotation is an reverse abstraction mechanism. The idea is, that the short notation is already there (2,1) and these symbols are converted into a longer description.
If indexing is focused on the creation and updating the database, annotation is about retrieving it. A retrieve request is equal to convert short symbol into long description. The short symbol depends on the concrete domain. The game of chess uses a different notation from a dance movement database which is different from the database for a self driving car on a road.
In a previous chapter the grounding problem was introduced as an abstraction mechanism. Abstraction has to do with mapping short notation into the full description. So annotation is the concrete task. What a programmer has to do is to search for short notation for a certain domain. For example in a car driving situation possible symbols would be (l=left, a=ahead, r=right, s=slower, f=faster) The underlying data model has the obligation to annotate these symbols with full length description. That means the action sequence [a,r,a,s] is converted into a detailed situation report.
1.1.8 1a8 Grounding with sensor fusion
Grounding is explained frequently in the AI literature, but especially newbies find it hard to understand the concept. It remains unclear how to ground a domain in software and what to do if a model was grounded successfully.
From a birds eye perspective, grounding can be seen as a mapping from labels to value. It is equal to a json file which contains of a domain specific taxonomy. From a low level perspective, grounding means to map sensor readings into a feature matrix. This is sometimes called sensor fusion.
But let us go a step backward and start from the begining. A robot has sensors, which are s1, s2 and s3. These sensors are providing numerical data all the time. Grounding means, to convert these information into a model. The model is list of the sensors, and their names. The resulting data model provides a perspective towards the world.
The next question which is frequently asked is what to do with such a model. Suppose we have a table in which the sensor values are recorded. The next step is to calculate the cost function and predict future system states. For example, if the robot is near the wall, the costs are high. The information what “near the wall” means is answered by the sensor model. There is one or many sensor values which are able to provide such information.
One possible reason why grounding is seldom realized for practical robots is, that no concrete algorithm or software library which is able to handle the task. From a technical persepective, a grounded model is some sort of database, which is json file or even an excel sheet. This sheet is collecting information. If the table has one or more columns for storing the feature, it is a grounded model which describes the domain. From a computer science persepective, such data structure is trivial and perhaps this is reason why it is ignored sometimes. Computer science is mostly about inventing algorithm and implementing them in sourcecode but such a challenge is not available.
1.1.9 1a9 Sensor fusion with feature tree
From a technical perspective it is very easy to create an advanced grounded model. All what is a needed is a json file or a python dictionary. The tree has around 20 nodes which are hierarchical structured and in each node the sensor data are logged. Such a file will need around 0.5 kb on the harddrive and can be created in any text editor. The surprising situation is, that most robots doesn't contain such a model and the programers are not planning for doing so.
The reason is, that the value of sensor fusion and especially the storage in a feature tree is unknown in mainstream robotics. That means 10 out of 10 programmer will say, that such a tree is trivial and be created easily but nobody knows the reason why such a tree makes sense. From a philosophy standpoint such a sensor taxonomy is equal to a grounded model. A domain is mapped into a data model and stored with concrete values in the main memory. At the same time it is hard to explain what the advantage is.
Perhaps it makes sense to explain the opposite situation. Suppose a line following robot doesn't store the sensor data into such a tree. At the same time, the robot is able to solve a problem. For example to stay on the line. The robot is doing so by using direct programming. In the program some if then statements are used to ask the sensor information for a value, and then an action is generated. That means the robot will reach the goal without using a grounded model.
Most robot programming tutorials will read the following way. The robot has to reac tot he environement and a statement like “if left sensor = dark, then right motor off” will do the task very well. The underlaying assumption is, that programming the AI has to do with creating program code. This code forms the algorithm. So it is a process oriented approach to control the robot.
In contrast, a grounded model puts the focus on the data model. The idea is, that no programming at all is needed, but the AI depends on a taxonomy which maps the domain to the robot.
1.2 1b Interactive robotics
Interactive robotics is an usual paradigm with in Artificial Intelligence. The reason is, that it contradicts the goal of creating fully autonomous machines. Interactive means usually, that the robot is controlled by a human in the loop. The human uses a remote control which can be smartphone or an XBOX gamepad. Calling such a machine a robot doesn't make sense, because the term is reserved only for real robots.
A closer look into the phenomena of interactive robotics will show that the concept is interesting even from an AI perspective. The reason is, that there is a fluid transition from teleoperation towards autonomous control. For example, a human operator can control the movements of the robot directly or he can point and click on a map for defining waypoints. The later case is nearer to autonomous control because the human operator interacts on a higher level.
Section 3.2
Let us try to describe the inbetween steps. In the basic approach the robot doesn't even has an electric motor. The human has to push the robot or the car manual forward. More advanced systems have an onboard motor and can be controlled with remote control. The next step are sophisticated GUI based interactive systems. And the interesting question is how does such a GUI interface has to look exactly.
The underlying problem is about low level vs. high level control. The idea is that the human operator interacts on a high level and the robot is doing subtasks by it's own. For example, the human paints a curve on the smartphone display and then the robot follows this line. These high level control systems are usually realized with a model in the background. The model stores domain specific information and the human operator interacts with this model.
The most easiest form of a robot model is perhaps an inverse kinematics. The ik paradigm simplifies the interaction with a robot arm. In theory many other models like body pose databases, predefined storylines and motion graphs are possible. What these advanced models have in common is, that they are producing surprisingly powerful GUI interfaces. The human operator can activate with a simple mouse click a complex behavior.
From the AI perspective it is important to know what a model is. Unfortunately, there is no clear definition available. A working thesis might be that a model is equal to a map in which actions can take place. It has much in common with a mathematical question. For example “a+x=3, what is x?”[Braun1983]. Such a question is the model and the player of the game has to determine the value for x. In case of robot control a typical model based question would read like “There are 5 waypoints on a 2d map, find the shorted path to node #3”.
The interesting situation is, that all these questions can be solved. The answer is either trivial or if not there are lots of algorithm and techniques available to solve the problem. The only hard element in the loop is how to create the quiz or the model itself.
1.3 1c From algorithm to data models
In the past, Artificial intelligence was working with certain assumptions. AI was often treated as solving a problem. So it was the same like finding and implementing an algorithm. For example, there are 6 cities given on a map and the AI has to find the shortest path. A possible algorithm would be A*. Another typical example from the past would be: “there is a chess board and the task is to place queens on the board, so that they are not crossing”. Similar to the example with the shortest path, the domain asks for an algorithm which is able to solve the issue. The term “to solve” is equal to an algorithm.
Such a perspective was common for AI until the the year 2000. And the perspective was adapted to more complex problems. For example “there is a biped robot and the problem is to find an algorithm which makes the robot move”. On the first look, such a biped walking problem has much in common with the 8 queen problem. The idea is, that it can be solved with an algorithm, and then the algorithm gets implemented in C/C++.
Section 3.1
But, from today's perspective such a strategy won't work. Over decades, AI researchers have tried to understand why. And indeed it is a bit hard to recognize why robotics problems are not the same like optimization problems. The reason is, that the problem was formulated the wrong way. First thing to do is not to search for an algorithm but to ask back for a simpler problem.
Let us go back to the issue with the biped robot. Instead of searching for an algorithm which makes the robot move the idea is to reformulate the domain first. So the answer to the problem is to reformulate the question? Exactly, and this is the reason why robotics is hard.
A more realistic domain to solve is the following task “Formulate a robot competition which can be solved by an average programmer”. That means, the participant of the quiz hasn't to create an algorithm but he has to invent a robot competition. For example, he can answer that “Micromouse like” Compettions are good. Or he can invent a small graph based game, in which the robot has to move from a to b.
Instead of searching for powerful algorithm to solve a problem, the trick is to search for easy to solve robotics domains. If such a domain was found it can be solved with an of the shelf algorithm, like A*.
The idea of inventing AI problems and inventing robotics challenge can be seen as revolutionary. It took until the year 2000 before the first robotics competition has arrived into public awareness.
More recent robotics projects are mostly about inventing such challenges. This is often introduced as model based robotics. Model based means, that someone invents an optimization problem, and then the same person or other can solve it. A typical example for a model would be a micromouse simulator wirtten in java, or the openAI gym environment with the cartpole problem is also a model. in both cases the challenge is not find the algorithm for solving these domains, but the challenge was to program the environment itself.
Section 1.1.1
1.4 1d Data only models
Computer science works mostly with an algorithm and programming perspective.
Section 1 The assumption is, that the programmer has to analyze first a problem and then a runnable program is created in a certain programming language. For example, the idea is to program a word processing application and for doing so submodules are created in the UML notation which are later converted into C++ software. The logical next step would be if the programmer shares the resulting code under an open source licence with other, so that future problem solving is more easier.
Unfortunately, this strategy fails in handling Artificial intelligence problems. In AI there is no need to program something. Sure, some robot control systems were created in java and other languages, but the code is trivial. Even if the code is available in an open source format other researchers won't profit from it. The more elaborated attempt to share information with others is to focus on data models. A data model is equal to the absence of source code but it is a text file. Examples for data only models are sprite sheet, motion graphs, body pose databases, story taxonomies, 2d occupancy maps and so on.
If another user has access to the same 2d obstacle map, he can program a solver to find the shortest path in the map. The interesting element is not the solver itself which is mostly a standard A* algorithm but the focus is on the 2d map. This map is created in a tile based editor and is stored in a json format or as plain ascii. The existence or the absence of the map determines if the robot is to solve the problem of pathfinding.
Suppose the idea is to create an advanced robot control system. Traditional programmer would ask in which language it was written. They are asking if it was programmed in C++, Java or Lisp. Other may ask which sort of algorithm or neural network architecture was utilized, for example a back propagation algorithm combined with a four layer neural network which is implemented in highly efficient C/C++. But, all the information are useless to understand the robot control system. The more elaborated attempt is to focus on non algorithmic elements which are equal to 2d maps, motion graphs, taxonomies and so on. These data models are holding the logic of a robot control system.
Section 1.1.2
The main obstacle in realizing data only models is perhaps the contradiction to established computer science. In computing the idea is mostly to program something which can be executed on a computer. This is what programming is about. Creating models with a sprite editor or drawing motion graphs in a GUI doesn't fit to this paradigm very well. Typing in text and drawing lines on the screen is treated as artistic application of a computer, but not as computer science or Artificial intelligence. This might explain, why data only models are not discussed frequently in the literature.
To overcome the contradiction we have to raise the abstraction level a bit. Before a programmer can implement an algorithm he will need a mathematical problem. For example, the problem is to search in a text for a string and solving the problem can be done with java plus a certain algorithm. Artificial Intleligence is not about solving problems, but AI is mostly about inventing problems. A data-only models helps to formulate such problems. For example, if the designer draws a map which contains of walls, this map is equal to the problem. The map is the environment and defines the constraints.
Section 1.1.2 It is possible to solve the map with a robot which is using a path finding algorithm.
1.4.1 1d1 From features to a model
Features are low level descriptions of the reality. Typical examples are the joint angles of a robot, or the words in a text. These information are aggregated in a datamodel. Such a model can be imagined as a taxonomy which sorts the information in a hierarchical way.
Model tracking means, to match the sensor information from a domain with the already created datamodel.[Shah2009] For example, the mocap information is feed to the model and the model recognizes that the situation looks similar to the existing entry #11.
Unfortunately it is hard to adept this general description to concrete robotics problems. The features in motion capture recording are different from feature models for a self driving car. What all the domains have in common is, that the features are stored on a temporal axis. That means, at a time step there is a set of features available.
Modeling a domain means usually to create a taxonomy with key poses of the system. In case of computer animation this is equal to keyframes or body poses. In case of document parsing it is equal to important sentences in a text.
1.4.2 1d2 Mocap models
Most existing research papers about grounded motion models have it's origin in the same technique called motion capture. Technically motion capture means only to record the positions of the marker into a database. The reason why this is interesting is because after the data was logged, some sort of model is needed to compress the information. This pros and cons of different clustering techniques are described in the literature.
But let us go a step backward and ask first what mocap is about? Instead of writing an AI algorithm for controlling a robot the idea is to let a human actor do an action, for example walk a meter forward. Tasks which are usually well understood like stand up, running and so on will become more interesting from the perspective of a mocap recording. Because the recorded numerical information provides the input data which needs to be processed with advanced statistics tools.
There are endless amount of strategies available for doing so. Neural networks, PCA and k-means clustering are one of the more easier to understand tools. The shared similarity is, that most robotics engineers will agree that motion capture is a key technique in understanding what Artificial intelligence is about. It will generate two simple question: how to store mocap data in a computer? How to use an existing mocap model to generate new trajectories in the virtual world?
Of course, both questions remain unanswered but is sounds interesting to investigate the subject in detail. Motion capture is at foremost a data capturing technique. It is similar to record temperature information with a weather station or measure the distance from a robot to an obstacle with a lidar sensor. It is some sort of low level sensor logging technique which can be utilized for a variety of domains.
1.4.3 1e Advantages of a grounded model
in the previous sections it was explained how to create a hierarchical data structure for storing the game state. Such a data structure can be mapped to an array in the computer memory so that the computer can use the array for further processing. What was left open is the reason why such a model has to be created.
The model itself is indeed useless. It can't control the robot. But the abstraction mechanism is needed by further steps. What qlearning, feature map visualization, cost function learning and neural networks have in common that they need a grounded dataset as input information. Such a data set is usually prepare during the preprocesing step. It results into a vector of input neurons which is feed to the neural network or used as input parameters for the cost function.
A data taxonomy is exactly such a dataset. And without such input data it is not possible to run a qlearning or any other algorithm. Basically spoken a grounded model is needed for nearly every advanced AI algorithm available. And the quality of the algorithm depends heavily on the data quality. That means, if the array was normalized, contains of a hierarchical structure and has something to do with the domain, then the chance is high that one of the previously mentioned algorithm is able to control the robot. From the perspective of reinforcement learning, a grounded dataset is equal to provide a vector with input neurons. Each neuron has a normalized value. For example neuron #1 is about the speed the robot, neuron #2 for the direction and #3 for potential obstacles.
Or let us formulate the situation the other way around. Suppose we want to start the qlearning algorithm but no game state is available in the array. How likely is the chance that the algorithm will do something useful? Right, it is a rhetorical question.
2 2 High level task planning
Motion capture and motion retrieval works on the lower level. The idea is to store a pose in a database which includes the angle of the joints and the velocity.
Section 2.4.3
In contrast, task planning tries to understand actions on a higher level. The main problem with task planning and interactive narrative is, that it is hard or even impossible to visualize the state. A keyframe which is used in motion planning can be visualized by rendering the character to the screen. In a walk cycle the keyframe shows a concrete body pose which includes the arm and the legs. But a high level concept like “character visits a place” is difficult to animate. Because the high level concept contains mostly of natural language, so it is a text string but not a picture.
A possible authoring tool which works on a high level is a plot taxonomy. It contains of of persons, places and events. For example, there is person #1, #2, #3 and each has a name, and a certain behavior. IN addition there are possible places #1, #2, #3 which have each a different location and a language. What the characters can do is to attend on the places and then an event is there.
To store these information a hierarchical taxonomy makes sense. It is some sort of graph or a mind map which holds information about the entities and their relationship. The interesting situation is, that this technique was firstly invented by book authors who are writing a longer novels. The system helps them to get an overview over potential interaction. For example, the book author knows that there are only 20 possible characters available. That means, it is not possible that a random character is created from scratch, but the characters can interact only with existing characters which are in the database already.
This principle helps to reduce the overall state space. Either a new character is invented and inserted into the taxonomy, or an existing character interacts with another character.
Let us try to describe the picture in detail. What is given in the example is a graph which has 6 nodes. Each node has a unique id and is located in a group like person, location or event. The tree can be used to tell stories. Instead of describing in detail how a character looks like it is enough to know only the node id for this character and the details are written in the database already.
The idea is that longer stories will contain of dozens of such nodes which are ordered in a certain way. Then it is possible to create the story itself which are actions over the time axis. The story is not told from scratch but the plot is referencing to existing entries in the database.
2.1 2a Comparing motion retrieval with story plotting
Both techniques are working with a model which is stored in the taxonomy. The model is equal to a plain text file which contains of information. In case of a motion taxonomy, the taxonomy holds information about body poses, and in case of a story, the database holds information about events and places.
The similarity is, that in both cases the model contains of data but no program code and secondly, the information is stored in nodes which have an id. For example, the node with the id #3 from the motion database stands for a certain pose, while the id #5 from the plot taxonomy holds a certain character. So the format is a key value storage for storing and retrieving information from a database.
The overall idea is first to convert a domain in a database and then use the database to generate actions. The perhaps most interesting situation is, that the model holds only textual information which can't be executed on a computer. That means, a simple json file is equal to the model. This json file contains of subsections. The json file acts an abstraction mechanism between the robot and the problem definition. That means, the robot is not asked to solve random problems but the robot operates always with json file as the model.
The problem with data driven models is, that they are located outside of classical computer programming. If a model holds only textual information equal to a database then there is no need to program something in python or with any other programming language. The chance is high that model creation is not even a computer science problem. Perhaps this explains why model generation is hard for computer programmers to realize.
Let us try to give a more abstract look towards the issue. There are examples available for textual models which are sprite sheets, ingame maps, plot taxonomies and body pose databases. The prefered file format to store these information is the png, txt or json format. From a computer perspective such a file has no meaning because it can't be executed by the operating system. On the other hand it is very important for modeling a problem.
2.2 2b Task Models
Similar to all models, a task models is equal to a video game. It can be executed on a computer and allows to simulate a domain.With this definition in mind, the preferred notation for a task model is executable source code for example in the python language or in C/C++.
Undulately, it is complicated to use a programming language to create a high level task language. The much easier strategy is to create first a text file which holds only a database and then create the executable task model on top. The idea is to postpone the programming step and create first a simple database.
A task model in a database format has much in common with designing interactive fiction. There are locations, characters, objects, events and possible actions and in each category multiple items are possible. The good news is, that in the early design stage only a textual description in table format is needed but not computer code. That means, the task model can be stored in a markdown file on a json file.
Let us analyze an example. Suppose there is a kitchen robot and the objective is to create a task model for this robot. Possible objects in the database are: the table, a spoon, the refrigerator. Possible actions are take, grasp, pour, ungrasp, open and close. The task database or the taxonomy contains of these entires and each entry is described briefly.
Such a task model can't be executed directly by a computer but a human can use this model to understand the domain. For example, a human will recognize that a certain recipe will need a sequence of actions. Or that the refrigerator can be opened and closed. This insights can't be concluded by the task model itself, but the human will need common sense knowledge. That means the task model is created with a realism.
It is important to know that model design hasn't to do with solving problems but to invent a problem. A possible role model is to assume that a robot challenge is created from scratch. That means, somebody has to invent the challenge in a way, that the participant have an easy to solve problem.
2.2.1 2b1 Model based robotics
In the past Artificial Intelligence was seen as a complicated problem. The engineers struggled mostly in building intelligent robots.
Section 1 From the historic literature it is known that even 40 years ago it was known that heuristics and models can help to improve the overall performance. But the problem was how to implement such techniques in detail.
In a search algorithm like A* a heuristic may help to find the pathway faster. The idea is to calculate the distance to the goal and this information is used to guide the search in the game tree. The problem with cost functions is, that they are based on domain knowledge. For most applications this additional information is not available or not encoded in a machine readable format. The consequence is, that at least in the past most search algorithm were created as blind search which means without any heuristics.
The unsolved question is how to define a model in general so that it might help to make the search faster. There is no such a thing like a general model, but what is available in robotics a domain specific models. It is possible to imagine a model for a biped robot, a model for a path planning problem and one for a grasping robot. The model for a path planning algorithm assumes, that the world contains of nodes which are arranged in a graph. And the idea behind a biped robot model is, that there is kinematic chain which is defined by angle rotation.
There are two possible techniques available to program a model in computer source code. a programming language and a database. Because the problem is highly complex both techniques are needed. That means, a good model contains of a data only text file plus an executable program. The reason why this complicated approach is discussed in recent robotics is because it helps to solve more advanced problems. This allows to create novel robot control systems which are running fast on slow hardware.
The difficulties of creating heuristics and models is often referenced as the grounding problem.
Section 1.1 It can be explained as intermediate layer between the robot and the domain. It is a middle layer which is often formulated in natural language. Solving the grounding problem is equal to realize Artificial intelligence. Because then the gap between computers and robotics applications is closed.
2.3 2c Book summary as mindmap
inventing the plot from scratch is a very hard task, even if index cards are used for this purpose. A simpler prestep is to create a summary for an existing story. Alll what is needed is a well known book and then a mindmap is created which summarizes the book.
A certain structure is visible automatically. The mind map can be grouped into a character section, into a places section and into a plot section. There are leaf nodes drawn for all elements and at the end the book is visualized on a single sheet of paper.
Mind maps and taxonomies have much in common. They have both a strong focus on connections of nodes. A typical property of a mindmap is that a node contains little amount of text. In most cases it is one word. And this word is connected by arrows with other nodes.
The term mindmap contains of the subword “map”. A map is equal to an environment.
Section 1.1.2 It specifies the room for thoughts. Everything what is known is stored in the mindmap while everything else is unknown.
A model which simulates the reality is grounded in a map. If no map is there the model remains empty.
2.4 2d json model
Textual information can be stored easily in the json format. Compared to the XML format [Lehtonen2006] [Kopp2018] json needs less characters. Even the disadvantage that json can't store a picture can be overcome. All what is needed is to put the filename into the json file and store the picture on the hard drive itself.
The idea is that model is given as a json file which contains of sections and then nested nodes are provided for each section The only rule is, that no programming code should be stored in a json file. The problem of writing code which is doing something useful with the json model is postponed to a later development step.
A typical description for a robotics domain can be read the following way: there is a json file which holds information about the kitchen which includes objects and waypoints. The robot has to parse the file and should execute the commands which are provided in the action section of the json model. With this problem description an average programmer can create the robot control system. The idea is to separate between the domain knowledge which is given in a purely textual format and the robotics software itself which is parsing the json file.
The advantage is, that there is no need to think about what the robot should do because the world model is given already. That means, everything what is not provided in the json file can be ignored.
2.4.1 2d1 Creating a text adventure with json database
Most programmers are assuming that python is a great way to program a text adventure. This is only partly true. Compared to C/C++ python is a here to stay, but never the less the productivity for writing python code is low. Even in Python, the python programmer has to write down if then statements and create objects. He needs many edit compile run cycle before a game works fine.
The problem can be solved by focusing only on a certain part of a python program which is the dictionary to store the text for the adventure game. Such a dictionary can be saved together with the source code in the same file, or in a separate json file. The advantage of the json format over normal python code is obvious because no programming at all is to create such a file. The idea is, that not programming is the important task in game design but writing down the json text is crucial.
For a text adventure the json file contains maybe information about objects, locations, characters and events. for example a node contains of an id, a name and a description. Such information is used by the game engine to display something on the screen.
2.4.2 2d2 Database retrieval
Suppose a database was created which holds the model for a domain. The most obvious application is to send a request to the database. In the easiest case the human user can enter “show me the item with id #3”. The database is searched for the request and then the answer is put onto the screen.
From a technical perspective such a task is trivial to realize. Because it is equal to a simple search in the json file.
Section 2.4 The python language has even built in commands for handling such request. The more demanding task is to understand why a database request makes sense in a robot system. The idea is that not a certain algorithm holds the artificial intelligence but the robot is using an existing database for all the decisions. The database defines how the map around the robot looks, the database contains information about a certain object and the database holds the waypoint list. All these information can be converted into intelligent behavior in the sense that the robot can navigate in a room, will reach the waypoints and is even able to grasp an object.
The json database acts as an abstraction mechanism between the robot program and the domain. The robot can ignore the reality and focus only one the database. This layered architecture reduces the state space and converts AI problems into simple to realize programming tasks like search in a json file for a certain id.
2.4.3 2d3 Motion retrieval
After the introduction which was mostly theoretically
Section 2.4.2 let us give an example how database retrieval works practically. The idea is, that there is some data stored in a taxonomy. The key is the motion id which goes from 1 to 5 and the value is the body pose of a walking character.
In the next step a request is send to the database, for example “give me the body pose for the id #3”. The database takes a look into the content, will recognize that there is an entry with the #3 id and returns the angle for the legs back to the main program. The perhaps most interesting situation is, that no dedicated algorithm is needed for this task. Either the programmer creates the search method by it's own or is using the built in query capability of a database. In all the cases the motion retrieval is trivial from a programming perspective. The logic of such a system is not located within the program code but it is hidden in the data itself. That means, somebody has created the walk cycle for the character and has labeled each body pose with an in. All the domain knowledge is stored in this data model.
Let us explore some possible applications. In an interactive GUI the human operator can move the mouse cursor over one of the cells, and the animation engine will show the selected body pose on the screen. If the human user moves the mouse from left to right over the cells, he will see a short animation on the screen.
2.4.4 2d4 Modeling vs. search
Classical Artificial Intelligence can be summarized as search in the state space.
Section 1 The idea is, that a problem is given for example the knapsack problem or the 8 queen problem
Section 1.3 and the task is to solve the problem with a computer. Not a human mathematician but the computer is used to find the answer. If the algorithm is working accurate, the machine has replaced the human and this is what Artificial Intelligence is about.
The untold assumption of this naive understanding is, that the problem was defined already and that the problem can be solved. For simple mathematical problems this is the case. There are text books available in which mathematical quizzes are provided.
Section 1.2 But for more advanced problems, especially from the domain of robotics, an exact problem definition is missing. How can we solve a problem, which wasn't defined already? Right, there is no way to write down an algorithm for an unknown problem and because of this reason, robotics has failed.
The logical answer to the challenge is flip the situation. Instead of searching for efficient algorithms, the assumption is, that these algorithms are already there and they are working great. Existing computer science provides more than a dozens of powerful search algorithm like A*, RRT and breadth first search which are universal in the sense that any problem can be solved with them. In addition, powerful hardware and even program libraries are there. The only thing which is missing right now is a problem, or a model.
With this background knowledge it is easy to understand what more recent AI projects are trying to achive. They are inventing problems and models which fits to real problems. For controlling an autonomous car, a traffic simulation model is needed, for controlling a biped robot, a biped robot model is needed and for figuring out the actions for a grasping arm, a model with possible grasps is needed.
The shared similarity is, that all these models are stored in a datastructure or in a database and they have nothing to do with algorithm or computer programming in the classical sense. The perhaps most impressive form of a model is the rule book for the micromouse challenge. This model consists of a simple US-letter sheet of paper in which it is written down what the competition is about. It defines that the robot has reach the goal, it has a map and it defines that the robot has only two sensors on front. Such a textual description can be seen of course as a model. It is the problem description. It is the environment in which the robot has to do something.
3 3 Slicing a sprite sheet
There is a less frequently documented technique available used by game programmers. It is called “slicing” and the idea is to cut of a sprite sheet into smaller parts. The goal is to animate a character and slicing is usually realized in writing down the source code which is able to do so.
Most description for this domain have it's roots in the stackoverflow forum [Stackoverflow2022] and he knowledge is not very well documented from a theoretical point of view. The reason why slicing is interesting is because the task has to do with converting data into executable program code. The input which is a sprite sheet is usually provided in a .PNG image file format. Such a format can't be animated directly but it stores only the keyframes for the animation. It is up to the programmer to extract the correct sprite image from the .PNG file and show it at the right moment on the screen. This single technique is called slicing and animation.
From a technical perspective the best practice method is, to divide the .PNG File first into a grid which has 32x32 Pixels and then use a for loop to extract the correct position. For example, the getsprite() function is called with the id=4, and then the source code determines the pixel position for this sprite in the .PNG file.
Even if slicing is not documented very well it is used frequently. Nearly all video games who are using animated characters will need to animate and slice the sprites, otherwise the user won't see something on the screen.
3.1 3a Example: Walking robot
Suppose the idea is to program a robot which can walk from left to right, also the robot should jump over an obstacle. What is best technique for programming such a system? Before it is explained how to do so let us take a look into the past, who robotics programmers in the 1990s would solve the task.
The assumption is to go from the problem towards the solving algorithm. The problem was described in the introduction which is “create a walking robot”. The appropriate algorithm would be a search in the state space. That means, the algorithm has to make sure that the robot doesn't loose the balance, and it has to reach the goal on the right. From a technical side, such an algorithm would send random signals to the servo motors and if certain goals are reached the robot applies the trajectories to the real robot.
On the first look, such an approach looks reasonable well, but it can't be implemented in the reality. The reason is, that no efficient search algorithm are known, also the existing computing hardware is to slow and in general there are many unsolved detail issues. The chance is high that the resulting robot won't walk forward but it will loose the balance very fast.
It is hard to recognize how to overcome the obstacles because the answer has nothing to do with using a certain algorithm or program the robot in a certain way, but the answer is to describe the situation from a more abstract perspective.
First thing to ask if the original task of programming a biped robot makes sense. Of course it doesn't make sense. The new thing is, that this sad perspective doesn't results into not programming such a robot but the conclusion is to reformulate the problem. Instead of searching an algorithm in the existing problem description the idea is to improve the original question first, so that it can be solved much easier.
The original problem was about a walking robot. This description is very hard to realize. The more elaborated task would be find a trajectory in a list of keyframes. The keyframes are given in advance. There are keyframes for walking forward and for jumping. And the robot has to search in this database for a correct sequence. Such a new domain description can be solved much easier with a robot and with hand coded algorithm. And again, if the resulting robot doesn't work the bottleneck is not the robot itself, but the problem description wasn't precise enough.
Before it is possible to find an algorithm or even implement the algorithm in a programming language, there is need to define the problem in a certain way. It doesn't make sense to try to solve problems which are obviously unsolvable. So the art is to invent problems which are easy to solve.
Let us try to describe in general what an easy to solve robotics problem is. IF the problem contains of the word teleoperation and database driven then it can be solved mostly. Teleoperation means, that not a real AI is utilized but the human operator remains in the loop. And database driven means, that the mode was specified in a textual format which is a map, a sprite sheet, or a body pose database.
Such problems are much easier because computer programmers are knowning in advance how to create such systems. A teleoperated robot is mostly realized with GUI interfaces and joysticks while database driven models are created with sqlite and json databases. If the robot problem can be transformed into such a category than it has a high success probability.
3.2 3b database driven teleoperation
Let us a give a concrete example for creating a robot control system. The idea is, that there is a joystick for th human operator and also a database for storing the body poses of the robot. The human operator selects with the joystick one of the keyposes and this interface allows him to solve any task. Such a system is missing of two things: first there is no autonomous robot and secondly there is no AI algorithm. This sounds a bit unusually because at least in the past it was assumed that both is equal to artificial intelligence. But robotics control can be realized much easier than with an algorithm.
The idea is to create some sort of interactive model. That means, the model is used to simplify the teleoperation of the robot. From a user's perspective it is some sort of abstract GUI interface. The human operator can click on a handful of buttons and this will result into the movements of the robot.
3.3 3c Data-driven animation models
Implementing an abstraction mechanism can be explained for the example of computer animation. This domain has the advantage, that the end user will see something on the screen, which proofs if the abstraction mechanism is working or not. In most cases the assumption is, that model based simulation is equal to write a computer program, similar to a video game. But it is a very complicated to task to invent such models from scratch.
The more easier to implement approach is a data driven model.[DeSmedt2017] [Lee2002] The idea is to use motion capture or draw the animation sheet by hand and avoid programming at all. The resulting model contains of a picture or numbers, but it doesn't contains executable code.
Somebody may argue, that such a model is useless because it doesn't allow to run the simulation. It is not possible to execute a .txt file or a mocap recording from the command line. So the task of converting a domain into a machine readable model has failed. What is missing in this pessimistic understanding is, that it is much easier to convert a data driven model into a computer program than creating such a program from scratch.
To understand why such a task is easy let us give an example. Suppose there is a database with 20 keyframes. The database is a simple .CSV file which has 20 lines of text. Suppose we are asking a game programmer if he can create an animation program for the csv file. The answer is yes. What he will do perhaps is to parse the csv file with a programming language of choice, and in the next step the numerical information is converted into a graphics which is rendered to the screen.
The overall task will need a bit of programming effort but it can be realized with standard programming techniques. The resulting program is of course a real model, because it can be executed on a computer. So we can say, that it is pretty easy to convert data models into executable simulations. And for this reason, there is no real difference if a data only file is given or an executable model.
3.3.1 3c1 Text to animation models
Most newbies are assuming that Artificial Intelligence is realized either in a programming library or in an algorithm which has to be programmed in software. The surprising thing is, that Artificial intelligence can be realized different. Instead of asking how to solve a certain problem the more interesting question is, which sort of problem is interesting from the perspective of Artificial Intelligence?
One famous example is the problem of converting text into an animation. Even if the problem was discussed already in two famous projects (SHRDLU and AnimNL) it should be given a short description what the idea is. At first the user enters a string of text into the edit box, e.g. “A man walks from left to right”, after pressing the animate button, this text is converted into a movie clip which is shown on the screen. Somebody may argue this problem is not very interesting, because computer animation is a well known domain in computing and at least 100 different programs are able to create animations. What is new and exciting is the idea to create animations with text as input information.
Natural language is an example for an abstract perspective. So a text to animation engine is a program which converts an abstract input into a detailed output. This sort of problem has a lot to do with artificial intelligence. The SHRDLU and AnimNL projects were mentioned already, both are using a very powerful but also complicated strategy to solve the challenge. Let us assume the source code for these projects was deleted and also the original papers are gone. How to write a text to animation system from scratch? In the chapter
Section 2.4.3 a short outlook was given. The keyframes are stored in a database, each keyframe is linked with a number and if the user enters the number the keyframe is shown again on the screen.
This is the basic strategy for creating text to animation systems and it is only a detail problem how to make the interaction more pleasant. What every text to animation system has in common is, that in the background the keyframes are given in the database and the parser is searching for the correct clips and combines them into a movie sequence.
3.4 3d Short notation for teleoperation control
To create robot control systems the first thing to do is to invent an abstraction mechanism. For the domain of a self driving car a possible taxonomy would be:
|
category
|
a
|
b
|
c
|
1
|
speed
|
stop
|
slow
|
medium
|
2
|
velocity
|
slower
|
unchanged
|
faster
|
3
|
lane keep
|
left
|
center
|
right
|
4
|
distance to front car
|
near
|
far
|
|
A possible example game state would be: [1c, 2b, 3b, 4b]. This short notation has a certain meaning which is provided in the table. All the possible world state can be mapped to such a short notation.
The short notation provided in the table acts as a model. It can be used to annotate an existing car which is doing something, or it can be used to control a car with teleoperation. The idea is, that the short notation acts as an abstraction mechanism between the domain and the computer program.
Creating such a notation and improving an existing one works in a hierarchical fashion. That means the programmer invents some category terms like speed and velocity. And these words are ordered in a graph. Then some possible sub categories like “speed slow” is imagined which allows to cluster real world situation.
Now let us describe how the teleoperation is working. What the human operator has to do is not to move the steering wheel nor determine the desired speed in miles per hour, but the human operator has to select the button with the short notation. For example, he can choose if the car should drive left in the lane or center or right.
Somebody may argue that such a user interface is not natural. And indeed it is different from controlling the car directly. The reason is, that a computer will need such a model in the loop to interpret human actions and to control a robot as well. Without a model (which can become complicated) it is not possible that a computer will grasp the meaning of a certain action.
It is obvious what the advantages of a short notation are so it makes sense to explore the idea further. In case of autonomous driving it is possible to extend the notation a bit. For example it is possible label a map with the desired speed. On a straight line the car is allowed to drive in medium speed, while on a crossing only slow speed is allowed. Another possible improvement would be take care of traffic lights or priority rules.
|
category
|
a
|
b
|
c
|
5
|
traffic light
|
green
|
yellow
|
red
|
6
|
can drive
|
yes
|
no
|
|
|
|
|
|
|
The resulting table is some sort of hierarchical feature map in which the domain of car driving is mapped to a short notation. An event like [6a] can be translated to “the car is allowed to drive with priority”.
But let us try to describe the concept from a more abstract perspective. The idea is to create a model. The model is not formulated in a programming language like python nor C/C++ but it is stored in a json table. So the model is some sort of database which holds the mapping from a short notation like [6a] into a long description. This mapping acts as a language between the domain and the computer. A possible statement from a computer perspective would be:
The car moves with 1c. There is a 5a event so the driver decides to let 2b. The distance to the other car is 4b.
For somebody how is not familiar with the short notation this sentence makes no sense. But if the long description for 1c, 2b and so on is known the sentence describes in detail what the current situation is. For a computer it is very easy to process short notations. It can be used in if-then statements and it can be stored in a list. It is even possible to calculate some statistics for the short notation because the situation in every millisecond is known. That means after a few second there is an endless amount of data available formatted in the short notation.
3.4.1 3d1 Taxonomy for short notations
A short notation is equal to a feature, and all the feature combined are written down in a taxonomy. From a technical perspective it is a database stored in a json file. This json file is used as a model for describing a domain. The idea is, that only the elements in the model can be seen in the reality. So the model limits the perception of the robot.
It remains an open problem which features are exactly utilized to describe a certain domain. In the easiest case the hierarchical data model is drawn as a mind map by the user. The advantage is that no programming at all is needed to create such a taxonomy. The only task is to create a json file for a certain domain. The json file will contain of some features which are ordered hierarchically. Each of the features has a name, some elements and of course an ID. This ID is equal to the short notation. The short notation is some sort of language which can be utilized to talk about the domain.
The interesting point is, that the long description formulated in natural language is only relevant for humans. In theory it can be written in the comment section of the json file. What the computer will need is only the short notation. That means, log files, and possible AI algorithms are all formulated in the short notation. For the computer it is the only language available.
More experience robotics programmers will argue, that a model which contains of only data but no sourcecode is useless to control a robot. Because a robot will need a program written in python or any other programming language. This is only partly true, because around a data model it is possible to create algorithms more easily. Suppose there is a datamodel which has 6 features which are mapped to a short notation. On top of this data model it is possible to write down python statements with if-then commands. For example if the robot is in state a he has to do action b. a and b was defined of course in the model first. Also it is possible to search in the database for a certain situation. Search requires also an algorithm formulated in a programming language. But all these algorithms can be created easily. The reason is, that its known what the objective is.
In contrast, programming a robot without a short notation is very complicated because it remains unclear what exactly the AI algorithm has to do next.
3.4.2 3d2 Inventing a notation from scratch
According to Wikipedia there are lots of existing notation formats for example therbligs for describing working activities, Labanotation for dance movements or the Aresti aerobatic symbols which is used to express flight maneuvers. Most of these notations have a long history and are poorly documented. It makes sense to reinvent the idea from scratch.
The shared similarity between all the notation is, that they are formulated as a taxonomy and they are containing of of textual information but not computer code. From a programmer perspective a notation is an ASCII file or more elegant a json file which holds domain specific information. In a figure, the notation is equal to a hierarchical graph which contains of nodes.
The notation provides an abstraction mechanism to talk about the subject. It is some code which maps the important actions and events to a short notation. From a more computer oriented perspective, all the notation can be stored in a 10 kb json file each. No matter if the domain is about dancing, chess playing, movement of the fingers or brick laying, it is a always equal to a small text only json file which is called the model.
The good news is, that there is no need to use a programming language for example Java in the model. Also there is no need to create large datasets with millions of entries. But the notation remains very compact.
The disadvantage is, that such a notation is created always manual. That means, somebody has to understand the domain very well and is able to convert the important elements into a notation format.
A notation is some sort of communication tool. It allows to perceive actions and it allows to command actions. In the example of chess playing this interaction is used frequently by the players. Somebody may notice, that a player has moved a figure from a2 to a3. Another sense making interaction would be to give the advice to a player to make the “d2 to d4” move because this is will improve the situation drastically.
The notation is used in a communication process about a domain which is understood by all the participants. The interesting situation is, that sometimes the notation is not accurate enough. For example, if a body pose taxonomy contains of only two keyframes which are sitting and standing, then it is hard to describe an action like jump. Because jump is not available in the model. In the reality such an action is valid, so we have to ask if the model is accurate enough.
3.4.3 3d3 Domain specific grounding with feature trees
The grounding problem is about a mapping table from a label to its meaning. It has much in common with a codebook which is a lookup table for coding and decoding of information.[Shirahama2017] A codebook is some sort of dictionary which defines how to speak about a subject.
For inventing such short notation from scratch it makes sense to use a hierarchical approach. A feature tree contains of nodes which are divided into sub classes.[Grosky1994][Reddy2009] At the end all the possible codes are assigned to a word or a phrase. Such a table acts as an intermediate layer between the robot and the domain. It is a grounded world model and provides an abstraction mechanism. Everything which is not given in the dictionary is not part of the reality.
The existence or the absence of a feature tree has a great impact how a robot is able to interact with the world. A robot which is using a grounded model will understand a command like “c2” easily. According to the look up table, C2 stands for “go to waypoint 2”. In contrast, a robot which doesn't have a model will interpret “c2” simply as a 2 byte long string which has no meaning at all. Such a robot will print “syntax error” to the screen.
It is correct, that robots and computers in general are not able to understand the meaning of words. Natural language is from a computer perspective similar to a string which has no meaning at all. But, a dictionary is able to understand words. Because the internal structure connects a word to a command.
3.4.4 3d4 Notation as an abstraction mechanism
An abstraction mechanism is an intermediate layer between a robot and a domain. It acts as a model and simplifies the interaction. A possible example for an abstraction mechanism is a notation. A notation is some sort of table which maps a short code towards a meaning. For example the code k1 stands for keyframe1 and contains of numerical joint values for this keyframe.
The advantage of a notation is, that it can be written in a script, the programer or the user can type in a sequence like [k1,k3,k1] and this stands for a short animation. Thanks to the notation table the computer knows how to expand each code into a full keyframe.
In general the importance of a notation remains unknown in the domain of artificial intelligence.. One reason might be that the from a technical perspective the principle is trivial. A notation can be visualized as a graph and can be implemented as a simple table. So it is compared to other programming technique very easy to realize. The interesting situation is that with a notation in the loop many robotics problems can be solved elegant. Instead of trying to program a robot which is doing something in the reality, the idea is that the robot has to execute a notation script and it is up to the programmer that the script makes sense.
A notation has much in common with a teleoperation interface. It doesn't think by its own but it translates user input into robot movements.
3.4.5 3d5 Notation in music sequencers
Music is using notation since a long time. Musical notes are short symbols which are referencing to a tone, while music sequences are using notations for sampling music. The idea is, that a keypress is triggering a note and this note will activate a recorded sound from the DRAM of the instrument.
Automatic drum machines are available since the 1980s and they have simplified the overall process dramatically. A drum machine has much in common with a computer program. There is no need to press the keys manual but the sequence is played back from the memory. And each tone in the song is activating the underlying signal. This allows the human operator to take an abstract point of view.
Drum machines are used today mostly for music production but not for creating dance movements in ballet. In theory the principle can be adapted. A midi signal can be converted into a labanotation which is triggering an animation. But the idea is very new and not available in mainstream yet.
Let us try to imagine how a drum computer for body movement has to look like. The ocmposer sitts on a keyboard with 100 keys. Each key is assigned to a body pose. For creating a longer sequence the composer has to press the keys in the correct order. For example a walking cycle is produced with the keys [10,11,12,13] while a jumping animation is located at the keys [50,51,52,53] Suppose the idea is the the character is walking forward for 3 seconds, then he jumps, and then he walks for another 2 seconds. This demand can be realized by pressing the keys in a certain sequence. Then the motion is recorded and gets played back.
4 4 Model based understanding of meaning
Grounding is about interpreting the meaning. For example, the word “dog” has a certain meaning. The problem is, that only humans are able to interpret natural language, while a computer will understand only numerical values. There is a work around available to program computer so that they will understand meaning. A model allows to parse a notation and then a meaning will be visible.
Let us give an example. There is a motion capture recording which contains of dance movements. A model allows to do an action recognition on the mocap data. The computer can recognize that at first the dancer is standing still, then he moves forward and then a jump is there. The meaning is recognized by translating the numerical position of the mocap marker into the predefined string which is “stand”, “forward” and “jump”.
The interpretation of meaning has to do with translating symbols into another symbols. Somebody might argue that it is useless to convert a symbol like “a2” into a symbol like “TT12”. Both symbols are for itself useless, because they are nothing but strings stored in the computer's memory. This is only partly true, because the translation is always the same and it follows the rules of a notation. The notation is the model which defines the purpose of the translation. For example a dance taxonomy contains of symbols and rules for dance movements, while a grasping taxonomy contains of words and poses for grasping actions.
A well established notation is the Unicode table. Thanks to the unicode format it is possible to print out any character on a computer screen. The unicode system itself is trivial, because it is only a long table which stores the code for all the chars. The interesting point is, that this translation can be utilized for improving the interaction with a computer. The idea is to extend the idea to other domains especially for robotics. Suppose there is a dance notation, a facial expression notation and a biped walking taxonomy. Then it is possible to generate the servo motor actions for a robot so that the machine can walk.
4.1 4a Abstraction with classification
Suppose there is a motion retrieval system which is aware of only 2 notations: [a1,a2]. “a1” stands simply for action 1 and means, that the character on the screen is in the upward position, while a2 is referencing to “sitting down”. Describing such a system from a programming perspective is not very hard, the user sends a command like “a1” to the system and then the robot on the screen is doing something. The harder approach is to describe what the underlying principle is about.
A mapping mechanism is sometimes introduces as classification. it is equal to provide some classes in advance (here [a1,a2]) and then it is possible to assign something. Let us take a closer look how such a system is working. A parser has to check first if a command is available in the list. A nonsense command like “c1” is getting rejected because it is invalid. If a statement was found, then the parser has to decide which class exactly was addressed. The classification principle is so common that it is not perceived as a problem. But, before Artificial intelligence can be realized it is important to know how it works.
According to figure “classifier” the system will recognize only a1 and a2 as symbols. All the other possible actions have no meaning. Basically spoken, grounding is equal to coloring some of the cells.
4.2 4b Hierarchical Topic Mining
The main problem with the grounding problem is, that it remains unclear what exactly the problem is. Instead of trying to solve it, the first step is define the challenge more precisely. A possible problem formulation is to annotate a given text with labels. And the labels are referencing to an external model. The idea is visualized in the figure.
There is a sentence “The quick brown fox jumps over the lazy dog.” and the idea is to provide meaning with the sentence by annotate it with references. Meaning in the sense of grounding is equal to create the mapping from the input stream to an external model. The model is a very small taxonomy which contains of only 3 entries. Or to be more precise, it is a list of categories. The grounding algorithm tries to create the matching.[Meng2020]
The surprising effect is, that a sentence which is grounded can be treated differently from a normal string. Without the mapping to the model, the sentence can only be analyzed by itself. For example an algorithm can search for a certain pattern like “brown” and can print the exact position in the sentence.
In contrast, the grounded sentence can be analyzed with a statement like “show me all animals in the sentence”. Even if the sentence doesn't contain the string “animal” it is possible to print out the correct answer “fox” and “dog”.
Let us try to describe the situation from the user perspective. What is needed first is a category tree. A human operator has to create a taxonomy of all the important words. He has to make sure, that the term dog is located in the animal category. With the help of such a tree the algorithm can parse any sentence automatically. The algorithm tries to match the input stream with the taxonomy and labels the found words.
Grounding is equal to a find a mapping between two different information sources. A sentence formulated in natural language has no meaning by itself. Only if the words are referencing to another file or database, a meaning can be extracted.
The meaning of the words depends on the model. In the figure with the “quick brown fox” the underlying model can categorize only by animal and color words. In a different table it is possible to provide the translation into another language and attach even a picture for the words. In such a case more details for the meaning are provided.
4.2.1 4b1 Cognitive map for a maze robot
Every robot has to maintain its game state. A game state is used for many applications for example for qlearning. One attempt in doing so is a cognitive map in a taxonomy structure. The figure shows an example for a maze robot.
The overall game state is subdivided into sections from a to e. The feature c1=80% has the meaning “battery level is at 80%”. An interesting sub element of the game state is the location b. It stores the position of the robot on a map. So it has much in common with the mapping problem.
It should be mentioned that the game state aka the cognitive map is a data only structure. Similar to a struct record known from the c programming language it stores some values in the memory. The main reason why it is important to create such a structure is because it helps to abstracts from the domain. From the robot perspective the world is encoded in the game state. And apart from the predefined categories from a to e there are no addition facts available about the world. The entire knowledge base consists of only 10 features which are grouped into 5 categories.
Let us take a closer look how the cognitive map is working together with the sensors and the actuators of the robot. From the outside perspective, the sensor information have to be stored somewhere. So there is an interrupt which is sending the sensor data to the cognitive map of the robot. On the other hand the decision making system of the robot needs to know the current sensor state of the system. So the module can send a request to the taxonomy for example “request(d1)” to get the current angle. That means, the hierarchical datastructure itself has no obligation but it acts as a middle layer between the domain and the robot.
The entire cognitive map can be stored in a simple 2d feature matrix. From a technical perspective there is no need to label the categories but the rows can be called as a,b,c and so on. The resulting 2d array is the game state for the robot. It stores every important information. It is updated all the time, and the decision making system is requesting all the time the content from this matrix. For example a rule like “if battery < 50 then search for docking station” needs as input the value from the feature matrix.
4.2.2 4b2 Limits of Graph databases
The amount of projects which are using a graph database in the context of robotics is low or even zero. The reason is that graph databases are an esoteric concept to store information and the combination with Artificial Intelligence doesn't make sense at least on the first look. The preferred assumption is, that robots doesn't need a database but an algorithm which will produces intelligence.
The cause for the misconception has to do with the opposite between ingame AI and the environment. The self understanding of an AI programmer is to create an intelligent robot for an existing environment. The game is available already and the robot has to win the game.
But let us try to flip the roles The idea is, that that the AI is equal to the environment, and the robot is acting with a random generator. So the question is how does have the environment has to look like so that the robot will win?
Programming a simulation or games works with a different principle than programming robots. The interesting situation is that for game programming there is a need to use databases. In the databases, the dialogue is stored, or the sprite sheet for the character.
4.3 4c From annotation to abstraction
From a technical side it is not very hard to annotate a text or even a game log with additional information. The parser recognizes that a certain word belongs to a category, or that a certain game state is stored in the database. The more hard to answer question is, why should somebody care about.
Annotation is an abstraction mechanism. it allows to convert a complex situation with millions of possible states into a simpler model. The model which is used for annotation reduces the state space drastically. On top of this model a cost function can be created much easier which will reduce the state space further.[Fernandez-Madrigal2004][Guerra-Filho2006] Let me give an example.
Suppose there is a biped robot which has 40 joints. All the joints can change over the time axis, so an endless amount of possible situations is possible. After matching the kinematic chain with a body pose database the amount of possible states is reduced to only 20. In the model it was defined that a small amount of categories and subcategories is possible and all the millions of possible joint parameters have to fit in this simplified taxonomy.
Even a model is sometimes hard to understand so there is need to reduce the state space further to a single numerical value which is the costs in the range from [0;1]. This value gives feedback if a situation is good or not and reduces the state space further.[Boularias2015] Creating such a cost function on top of an annotated dataset is much easier than doing so without a model in the loop. The reason is, that the model has a smaller amount of possible states which can be checked with some rules.
5 Bibliography
- Boularias, Abdeslam, et al. "Grounding spatial relations for outdoor robot navigation." 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015.https://ieeexplore.ieee.org/document/7139457
- Braun, Martin, et al., eds. Differential equation models. Vol. 1. Springer-Verlag, 1983.https://link.springer.com/book/10.1007/978-1-4612-5427-0?noAccess=true
- De Smedt, Quentin, et al. "Shrec'17 track: 3d hand gesture recognition using a depth and skeletal dataset." 3DOR-10th Eurographics Workshop on 3D Object Retrieval. 2017.https://hal.archives-ouvertes.fr/hal-01563505/document
- Fernandez-Madrigal, Juan-Antonio, Cipriano Galindo, and Javier Gonzalez. "Assistive navigation of a robotic wheelchair using a multihierarchical model of the environment." Integrated Computer-Aided Engineering 11.4 (2004): 309-322. https://content.iospress.com/articles/integrated-computer-aided-engineering/ica00184
- Grosky, William I., and Jiang Zhaowei. "Hierarchical approach to feature indexing." Image and Vision Computing 12.5 (1994): 275-283.
- Guerra-Filho, Gutemberg, and Yiannis Aloimonos. "Human activity language: Grounding concepts with a linguistic framework." International Conference on Semantic and Digital Media Technologies. Springer, Berlin, Heidelberg, 2006.https://link.springer.com/chapter/10.1007/11930334_7
- Kirk, James, Aaron Mininger, and John Laird. "Learning task goals interactively with visual demonstrations." Biologically Inspired Cognitive Architectures 18 (2016): 1-8. https://scholar.google.com/citations?user=51Wqo6UAAAAJ&hl=th
- Kopp, Oliver, Anita Armbruster, and Olaf Zimmermann. "Markdown Architectural Decision Records: Format and Tool Support." ZEUS. 2018. http://www2.informatik.uni-stuttgart.de/zdi/buecherei/NCSTRL_listings/projekt/IC4F.html.en
- Lee, Jehee, et al. "Interactive control of avatars animated with human motion data." Proceedings of the 29th annual conference on Computer graphics and interactive techniques. 2002.http://graphics.cs.cmu.edu/projects/Avatar/avatar.pdf
- Lehtonen, Miro, Nils Pharo, and Andrew Trotman. "A Taxonomy for XML retrieval use cases." International Workshop of the Initiative for the Evaluation of XML Retrieval. Springer, Berlin, Heidelberg, 2006. https://inex.mmci.uni-saarland.de/static/proceedings/INEX2006-preproceedings.pdf
- Meng, Yu, et al. "Hierarchical topic mining via joint spherical tree and text embedding." Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2020.https://dl.acm.org/doi/abs/10.1145/3394486.3403242
- Reddy, Kishore K., Jingen Liu, and Mubarak Shah. "Incremental action recognition using feature-tree." 2009 IEEE 12th international conference on computer vision. IEEE, 2009.
- Shah, Nigam H., et al. "Ontology-driven indexing of public datasets for translational bioinformatics." BMC bioinformatics. Vol. 10. No. 2. BioMed Central, 2009.https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-S2-S1
- Shirahama, Kimiaki, and Marcin Grzegorzek. "On the generality of codebook approach for sensor-based human activity recognition." Electronics 6.2 (2017): 44.https://www.mdpi.com/2079-9292/6/2/44
- Stackoverflow2022, “slicing sprite sheet site:stackoverflow.com”, requested on April 6, 2022
- Watkins, Ryan. Procedural content generation for unity game development. Packt Publishing Ltd, 2016. https://www.packtpub.com/product/procedural-content-generation-for-unity-game-development/9781785287473
- Webber, Bonnie, and Barbara Di Eugenio. "Free adjuncts in natural language instructions." COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics. 1990.https://aclanthology.org/C90-2068/
- White, David A., and Ramesh C. Jain. "Similarity indexing: Algorithms and performance." Storage and Retrieval for Still Image and Video Databases IV. Vol. 2670. International Society for Optics and Photonics, 1996.https://www.spiedigitallibrary.org/conference-proceedings-of-spie/2670/0000/Similarity-indexing-algorithms-and-performance/10.1117/12.234810.short?SSO=1