Robotics and Artificial Intelligence: October 2024

October 31, 2024

Chronological notetaking with problems

The easiest way to make notes is to append new information at the end. There is no need to use a ring binder or note cards but a normal notebook works fine. A possible entry would be “date: March 10, headline, body of text”. The note is written down on paper and can be read again forever.

There is a large problem available with such a note taking technique. Chronological notetaking will mess up the notes on the long run. Notes about mathematics will follow notes about history and language. The only sorting order available is the date. This makes it hard to read the notes again. Let me give an example.

Suppose the user has written down over a horizon of 2 weeks multiple notes from different subjects. Now he likes to retrieve all the notes about math. The problem is, that the math notes are distributed over different sections in the notebook. Some math notes are written down on march 10, while other are from march 16 and the notes in between have nothing to do with prime numbers but are from different subjects.

Even if the notes are technically well written its impossible to retrieve them because they aren't sorted by topic but only by date. To sort notes by topic, there is need to use a different principle than only adding notes at the end. Some sort of ring binder or Zettelkasten is needed. With such a tool its possible to add new math notes after the old math notes.

The main advantage is, that its much easier to read the notes again. All the math notes are collected together. The user doesn't has to read the entire notebook, but only the math section. This makes it likely that the notes are consulted multiple times. The disadvantage of ring binders and card catalogs is, that its more demanding to create the notes. Before new notes can be added to the system the correct position has to be located. Which slows down the writing process.

October 22, 2024

Pipeline for developing chatbot based robotcs

Since the introduction of large language models in the year 2023, the term "artificial intelligence" was defined very clearly. AI means simply that the user interacts with a chatbot in natural language. The chatbot is able to answer questions, can draw pictures and also the chatbot controls a robot.

From the user perspective, the software works suprisingly easy to explain. The user enters a sentence like "open left hand" and submits the text to the chatbot. The chatbot is parsing the sentence and executes the command, which means that the robot will open indeed the hand. More complex actions like moveing to a table and wash all the dishes are the result of more advanced text prompts which contains a list of sub actions.

The unsolved issue is how to program such advanced chatbots in software. Before a user can talk to a robot this way, somebody has to program the chatbot first which can be realized in C++, Java, and so on. The basic element of any chatbot isn't a certain programming library like nltk and its not a certain operating system like Windows or Unix, but the needed building block is a dataset. Or to be more specific, a dataset which maps language to perception, and language to action.

Such a mapping is realized with multiple column because the dataset is always a table. In the easiest example the table consists of a images in the first column and the nouns in the second column:

[picture1.jpg], apple
[picture2.jpg], banana
[picture3.jpg], table
[picture4.jpg], spoon

The chatbot software is using such a dataset to understand a text prompt from the user. For example if the user types into the textbox "take apple". The word apple is converted into [picture1.jpg] and this picture allows to find the apple with the camera sensor.

More complex interactions like generating entire motion data are provided the same way. A verb like "grasp" is converted into motion capture trajectory with the following table:

[trajectory1.traj], open
[trajectory2.traj], grasp
[trajectory3.traj], standup
[trajectory4.traj], sitdown
[trajectory5.traj], moveto

Let me give a longer example to make the point clear. Suppose the user enters the command "moveto table. grasp apple". This command sequence is converted into:
1. [trajectory5.traj], moveto
2. [picture3.jpg], table
3. [trajectory2.traj], grasp
4. [picture1.jpg], apple

In the next step of the parsing pipeline the given jpeg images and .traj data are converted into search patterns and motion pipelines. This allows to convert a sentence into robot actions.

There are mutiple techniques available how to program a chatbot in detail. Its possible to use ordinary programming languages or more advanced deep neural networks. What these methods have in common is, that they are requiring always a dataset in the background. Somebody has to create a table with pictures with objects, and annotate the pictures. Also a dataset with mocap data is needed. Such a dataset allows a chatbot to convert a short sentence into something meaningful. Meaning is equal to a translation task from a word into a picture, and from a word into a trajectory.

So we can say, that a chatbot is the frontend of an AI while the dataset is the backend.

October 11, 2024

Einführung in grounded language

Ins deutsche übersetzt heißt es soviel wie gelenkte Sprache oder strukturierte Sprache. Es geht darum die Realität mit Hilfe eines Fragebogens zu beschreiben um darüber die Maschinenlesbarkeit zu erhöhen. Dazu ein Beispiel:

Angenommen, es soll eine Verkehrszählung durchgeführt werden. Im einfachsten Fall beginnt man ohne größere Vorbereitung und führt eine Strichliste. Die bessere Alternative besteht darin, vor dem Zählen zuerst einmal ein Formblatt zu entwerfen worin Variablen definiert werden, welche beschreiben was genau gezählt wird. Zur Erfassung des Autoverkehrs bieten sich folgende Variablen an:
Fahrtrichtung: von links / von rechts
Autofarbe: schwarz / blau / grün / rot / sonstiges
Fahrzeugtyp: PKW / LKW / Bus / sonstiges
Uhrzeit

Ein solcher Zählbogen ist weitaus detailierter als eine simple Strichliste weil man viele wertvolle Details zum Straßenverkehr erhält. Die o.g. statistischen Variablen sind identisch mit grounded language. Es wird dabei natürliche Sprache so verwendet, dass ein Vektorraum entsteht innerhalb derer dann etwas gezählt wird. Am Ende der Verkehrszählung kann man dann sagen, wieviele rote PKWs von links kamen oder wieviele LKW insgesamt die Straße befahren haben.

Im Bereich Statistik und Soziologie sind solche Erhebungsbögen allgemein bekannt und werden seit Jahrzehnten verwendet. Sie dienen als Hilfsmittel um komplexe Fragestellungen zu strukturieren und zahlenmäßig zu erfassen. Was jedoch neu ist, ist dass mit der selben Methode das Problem der Robotik und Künstlichen Intelligenz maschinenlesbar aufbereitet werden kann.

Im Grunde muss ein Roboter, der sich in einem Labyrinth bewegen soll, nur mit einem Formblatt versehen werden, was grounded language enthält. Über dieses Formblatt kann die Umgebung in einen symbolischen Vektorraum überführt werden, also maschinell-mathematisch gespeichert werden. EIne Kategorie in dem Formblatt wie z.B. "Farbe=blau" kann entweder wahr oder falsch sein. Die Strichliste kann einen Strich enthalten oder eben nicht. Darüber vermag der Roboter ein Protokoll anzulegen und Entscheidungen zu treffen. Das komprimiert den Handlungsraum. Anstatt Sensoren hardwaremäßig abzufragen, hat der Roboter ein konzeptionelles Verständnis der Umgebung. Man erhält einen high level Sensor der die Daten des Fragebogens inkl. der darin verwendeten Sprache verwendet.