October 22, 2024

Pipeline for developing chatbot based robotcs

 Since the introduction of large language models in the year 2023, the term "artificial intelligence" was defined very clearly. AI means simply that the user interacts with a chatbot in natural language. The chatbot is able to answer questions, can draw pictures and also the chatbot controls a robot.

From the user perspective, the software works suprisingly easy to explain. The user enters a sentence like "open left hand" and submits the text to the chatbot. The chatbot is parsing the sentence and executes the command, which means that the robot will open indeed the hand. More complex actions like moveing to a table and wash all the dishes are the result of more advanced text prompts which contains a list of sub actions.

The unsolved issue is how to program such advanced chatbots in software. Before a user can talk to a robot this way, somebody has to program the chatbot first which can be realized in C++, Java, and so on. The basic element of any chatbot isn't a certain programming library like nltk and its not a certain operating system like Windows or Unix, but the needed building block is a dataset. Or to be more specific, a dataset which maps language to perception, and language to action.

Such a mapping is realized with multiple column because the dataset is always a table. In the easiest example the table consists of a images in the first column and the nouns in the second column:

[picture1.jpg], apple
[picture2.jpg], banana
[picture3.jpg], table
[picture4.jpg], spoon

The chatbot software is using such a dataset to understand a text prompt from the user. For example if the user types into the textbox "take apple". The word apple is converted into [picture1.jpg] and this picture allows to find the apple with the camera sensor.

More complex interactions like generating entire motion data are provided the same way. A verb like "grasp" is converted into motion capture trajectory with the following table:

[trajectory1.traj], open
[trajectory2.traj], grasp
[trajectory3.traj], standup
[trajectory4.traj], sitdown
[trajectory5.traj], moveto

Let me give a longer example to make the point clear. Suppose the user enters the command "moveto table. grasp apple". This command sequence is converted into:
1. [trajectory5.traj], moveto
2. [picture3.jpg], table
3. [trajectory2.traj], grasp
4. [picture1.jpg], apple

In the next step of the parsing pipeline the given jpeg images and .traj data are converted into search patterns and motion pipelines. This allows to convert a sentence into robot actions.

There are mutiple techniques available how to program a chatbot in detail. Its possible to use ordinary programming languages or more advanced deep neural networks. What these methods have in common is, that they are requiring always a dataset in the background. Somebody has to create a table with pictures with objects, and annotate the pictures. Also a dataset with mocap data is needed. Such a dataset allows a chatbot to convert a short sentence into something meaningful. Meaning is equal to a translation task from a word into a picture, and from a word into a trajectory.

So we can say, that a chatbot is the frontend of an AI while the dataset is the backend.