Maniac Mansion is a well known point&click adventure. With the help of a walk through tutorial its possible to win the game. The standard tutorial consists of keypoints and full sentences written in English which can be read by humans but can't be executed by a computer. With a converter from high level to a low level layer its possible to transform the walk through tutorial into machine readable commands. The process is demonstrated with the following json code adressing the kitchen scene:
{
"card_id": "MM_KITCHEN_01",
"scene_title": "The Mansion Kitchen - ScummVM Navigation",
"content": {
"textual_description": {
"objective": "Enter the kitchen to retrieve the Small Key from the counter while staying alert for Nurse Edna.",
"key_points": [
"The kitchen is located through the first door on the right in the main hallway.",
"Crucial Item: The Small Key is sitting on the counter near the sink.",
"Hazard: Opening the refrigerator triggers a cutscene/event that can lead to capture.",
"Exit Strategy: Use the door to the far right to enter the Dining Room if the hallway is blocked."
]
},
"low_level_representation": {
"engine_context": "ScummVM - 320x200 Resolution (Original Scale)",
"mouse_interactions": [
{
"step": 1,
"verb_action": "PICK UP",
"verb_coordinates": { "x": 40, "y": 175 },
"target_object": "Small Key",
"target_coordinates": { "x": 165, "y": 115 },
"result": "Key added to character inventory."
},
{
"step": 2,
"verb_action": "WALK TO",
"verb_coordinates": { "x": 10, "y": 165 },
"target_location": "Dining Room Door",
"target_coordinates": { "x": 305, "y": 110 },
"result": "Character transitions to the next room."
}
],
"safety_note": "Avoid clicking 'OPEN' (x: 10, y: 175) on the Refrigerator (x: 240, y: 90) unless you have a specific distraction planned."
}
}
}
Both layers (low level and high level) are describing the same scene which is to enter the kitchen and fetch the key. The difference is, that that the layers have a different abstraction level. The high level layer is prefered by humans and mirrors how humans are thinking and how they are using language. In contrast, the low level layer is prefered by machines who are programmed with a logic oriented mathematical notation.
The converter has the task to translate between these layer which is known as the symbol grounding problem. Solving the grounding problem means to improve human to machine interaction.
Robotics and Artificial Intelligence
March 28, 2026
Human to robot interaction with a dikw pyramid
Solving the first scene in Maniac Mansion with a DIKW pyramid
Symbol grounding means basically to convert abstract description into detailed description. A concrete example with 3 layers for the first scene of the point&click adventure Maniac Mansion is shown next. The textual description can be understood by a human easily but can't be submitted directly to a computer. In contrast, the low level "pyautogui_commands" are hard to read for a human but can be processed by a computer program with ease.
Symbol grounding means basically that an algorithm converts high level descripiton into low level commands. With such a grounding algorithm its possible to script the game by providing textual description and the Artificial intelligence converts these description into mouse movements which are submitted to the SCUMMVM engine.
{
"notecard_id": 1,
"scene_title": "The Front Yard",
"content": {
"textual_description": {
"objective": "Gain entry to the Edison Mansion.",
"key_points": [
"Start with Dave outside the main gate.",
"Walk toward the front door of the mansion.",
"The door is locked; the key is hidden nearby.",
"Look under the doormat to find the silver key.",
"Use the key to unlock the door and enter."
]
},
"low_level_representation": {
"resolution_reference": "800x600",
"actions": [
{
"step": 1,
"action": "Select Verb: WALK TO",
"pixel_coords": [120, 480],
"note": "Clicking the 'Walk to' verb in the UI tray."
},
{
"step": 2,
"action": "Target: Front Door",
"pixel_coords": [400, 300],
"note": "Moving the character to the mansion entrance."
},
{
"step": 3,
"action": "Select Verb: PULL",
"pixel_coords": [250, 480],
"note": "Preparing to move the mat."
},
{
"step": 4,
"action": "Target: Doormat",
"pixel_coords": [400, 420],
"note": "Revealing the hidden key."
},
{
"step": 5,
"action": "Select Verb: PICK UP",
"pixel_coords": [50, 520],
"note": "Collecting the key."
}
]
},
"pyautogui_commands": [
"import pyautogui",
"pyautogui.PAUSE = 0.5",
"# Walk to door",
"pyautogui.click(120, 480)",
"pyautogui.click(400, 300)",
"# Pull mat",
"pyautogui.click(250, 480)",
"pyautogui.click(400, 420)",
"# Pick up key",
"pyautogui.click(50, 520)",
"pyautogui.click(405, 425)"
]
}
}
March 27, 2026
Abstieg in der DIKW Pyramide am Beispiel Zak mckracken
Damit ein Large language model ein Videospiel automatisiert durchspielt braucht es mehrere Ebenen aus der DIKW Pyramide. Auf layer 3 (knowledge) wird eine Spielszene in Stichpunkten beschrieben auf einer sehr hohen Abstraktionsschicht. Dies wird dann in den layer2 übersetzt, der viel präziser ist aber weniger leicht zu lesen für einen Menschen und schlußendlich auf den Layer1 transformiert der die low level Daten Ebene darstellt. Der Layer1 kann dann an die Game engine gesendet werden, also an die ScummVM welche das point&click adventure ausführt.
Hier alle 3 layer der DIKW pyramide in einer übersichtlichen json notation.
{
"game": "Zak McKracken and the Alien Mindbenders",
"card_id": 1,
"title": "Morgenroutine in San Francisco",
"representation_1_natural_language": {
"format": "Karteikarte (Menschlich)",
"content": [
"Wache in Zaks Schlafzimmer auf.",
"Nimm das Aquarium-Netz unter dem Bett.",
"Gehe ins Wohnzimmer und nimm die Fernbedienung vom Fernseher.",
"Gehe in die Küche.",
"Nimm das stumpfe Brotmesser aus der Spüle.",
"Öffne den Kühlschrank und nimm das Ei."
]
},
"representation_2_intermediate_logic": {
"format": "Text-to-Action Reasoning (Zwischenschritt)",
"note": "Hier werden implizite Aktionen und Raumwechsel für die KI logisch explizit gemacht.",
"logic_chain": [
{"state": "Room: Bedroom", "goal": "Inventory: Fishnet", "sub_action": "PickUp(Fishnet, under_bed)"},
{"state": "Room: Bedroom", "goal": "Change Room", "sub_action": "WalkTo(Door_West)"},
{"state": "Room: Living Room", "goal": "Inventory: Remote", "sub_action": "PickUp(Remote_Control, on_TV)"},
{"state": "Room: Living Room", "goal": "Change Room", "sub_action": "WalkTo(Door_North)"},
{"state": "Room: Kitchen", "goal": "Inventory: Knife", "sub_action": "PickUp(Bread_Knife, in_Sink)"},
{"state": "Room: Kitchen", "goal": "Access Fridge", "sub_action": "Open(Refrigerator)"},
{"state": "Room: Kitchen", "goal": "Inventory: Egg", "sub_action": "PickUp(Egg, inside_Fridge)"}
]
},
"representation_3_low_level_scumm": {
"format": "SCUMM Engine Executable (Low Level)",
"note": "Direkte Opcode-artige Anweisungen, die Objekten IDs und Verben zuordnen (fiktive IDs).",
"commands": [
{"op": "CUTSCENE_START"},
{"op": "PICK_UP", "obj_id": 142, "comment": "Fishnet"},
{"op": "WALK_TO_OBJECT", "obj_id": 201, "comment": "Door to Living Room"},
{"op": "PICK_UP", "obj_id": 155, "comment": "Remote Control"},
{"op": "WALK_TO_OBJECT", "obj_id": 202, "comment": "Door to Kitchen"},
{"op": "PICK_UP", "obj_id": 160, "comment": "Bread Knife"},
{"op": "OPEN", "obj_id": 175, "comment": "Refrigerator"},
{"op": "PICK_UP", "obj_id": 176, "comment": "Egg"},
{"op": "CUTSCENE_END"}
]
}
}
Das interessante an dem Ansatz ist die Abwesenheit einer künstlichen Intelligenz im klassischen Sinne. Es gibt also kein neuronales Netz oder einen Reinforcement Learning algorithmus welches das Spiel durchspielt sondern die KI wurde so implementiert, dass sie zwischen den layern der DIWK pyramide eine Übersetzung ausführt. Wenn man möchte dass im Videospiel eine andere Aktion ausgeführt wird, schreibt man in den obersten Layer andere Stichworte hinein, die werden dann in echtzeit in die unteren Layer transformiert und ausgeführt. Das prinzip ist vergleichbar mit Unix pipes, also einer Filterkaskade die input in output übersetzt.
LLMs als ingame AI
Large language modelle genießen eine große Popularität doch bisher wird dessen Potential nur teilweise ausgeschöpft. So finden LLM ihren Einsatz als Suchmaschine, zum Programmieren oder um Fragen zu beantworten. Hingegen werden sie kaum bis gar nicht eingesetzt um Computerspiele automatisiert durchzuspielen. Zwar gibt es neuere Entwicklungen wie Deepmind Sima 2, was durchaus in der Lage ist innerhalb von virtuellen Welten zu agieren, doch ist der Tokenverbrauch hoch und es gibt nur wenige Dokumentationen dazu.
Heutige LLM sind durchaus imstande in virtuellen Welten zu agieren. Als Beispiel sei hier eine stichpunktartige Komplettlösung für das Maniac Mansion Point&click adventure gegeben welche durch ein LLM from scratch erzeugt wurde:
Karte 1: Start
Team-Wahl: Empfohlen: Syd oder Razor (Musik-Talent) + Bernard (Technik).
Eingang: Gehe zum Haus. Unter der Fußmatte liegt der Türschlüssel.
Eingangshalle: Benutze Schlüssel mit Haustür. Drinnen: Vorsicht vor Edna oder Weird Ed! Wenn sie kommen, schnell in einen Nebenraum flüchten.
Küche: Gehe nach rechts durch die Tür. Nimm das Fleisch aus dem Kühlschrank und die Getränkedose (für später).
Karte 2: Das Gefängnis & Der Fluchtplan
Trigger: Wenn ein Charakter geschnappt wird, landet er im Kerker.
Ausbruch: Drücke den losen Stein an der Wand (links neben der Tür), um die Tür kurz zu öffnen.
Taktik: Nutze einen Charakter als „Köder“, um die Edisons abzulenken, während die anderen zwei das Haus erkunden.
Wichtig: Speichere oft! Ein falscher Schritt führt zum Game Over.
Karte 3: Die Vorräte im Erdgeschoss
Esszimmer: Nimm die Früchte von der Schüssel.
Speisekammer: Nimm den Glasschneider (hinter der Glasvitrine) und den Saft.
Wohnzimmer: Benutze die Kassette mit dem Kassettenrekorder (wichtig für später, um das Glas zu zerschmettern).
Bibliothek: Öffne die Lampe, um das Zimmer zu erhellen. Nimm den Loseblatt-Katalog.
Karte 4: Weird Eds Zimmer (1. Stock)
Vorsicht: Wenn Ed dich sieht, fliegst du raus.
Ablenkung: Warte, bis Ed sein Zimmer verlässt (um das Paket zu holen) oder nutze die Türklingel an der Haustür, um ihn runterzulocken.
Gegenstand: Nimm den Hamster (nur als Razor/Syd relevant für böse Enden) und das Sparschwein. Brich das Schwein auf, um Geld zu erhalten.
Schlüssel: Nimm den Schlüssel vom Nachttisch.
Karte 5: Das Musikzimmer & Die Glasplatte
Vorbereitung: Geh ins Musikzimmer (1. Stock).
Aktion: Benutze den Glasschneider mit der Vitrine, um die Kassette zu stehlen.
Klavier: Wenn du Razor oder Syd bist, kannst du hier spielen, aber das Ziel ist die Schallplatte im Zimmer daneben.
Tipp: Du musst die Schallplatte auf den Rekorder im Wohnzimmer legen und aufnehmen, um später das Fenster im Labor zu sprengen.
Karte 6: Das Schwimmbad (Außenbereich)
Gefahr: Das Wasser ist radioaktiv! Nicht hineingehen.
Aktion: Leere das Becken, indem du das Ventil im Keller (unter der Küche) benutzt.
Fundstück: Am Boden des leeren Beckens liegt der leuchtende Schlüssel.
Wichtig: Fülle das Becken sofort wieder auf, da sonst der Reaktor überhitzt (Game Over Timer startet!).
Karte 7: Der Grüne Tentakel
Ort: 2. Stock, Weg zum Dachboden.
Problem: Der Grüne Tentakel hat Hunger und lässt dich nicht vorbei.
Lösung: Gib ihm den Obstteller aus dem Esszimmer und den Saft (oder die Cola).
Ergebnis: Er wird dein Freund und lässt dich passieren. Du hast nun Zugang zu den oberen Räumen und dem Funkgerät.
Karte 8: Das Paket & Das Telefon
Paket: Wenn die Türklingel läutet, renne mit einem Charakter zur Haustür und schnapp dir das Paket, bevor Ed es bekommt.
Inhalt: Briefmarken.
Telefon: Benutze das Telefon im Arbeitszimmer. Wähle die Nummer vom „Metzger“ (findest du im Loseblatt-Katalog), um Edna abzulenken.
Nächster Schritt: Während Edna telefoniert, schleiche in ihr Zimmer, um den Schlüssel zum Labor zu finden.
Diese Anleitung gibt in natürlicher Sprache einen Ablauf vor um das Spiel erfolgreich zu spielen. Einziges Problem bei dieser Anleitung ist, dass es kein ausführbarer Computer code ist sondern an menschliche Leser adressiert wurde. In der DIKW pyramide ist die Komplettlösung also auf dem Layer 3 (knowledge) angesiedelt. Damit eine KI Maniac Mansion automatisiert durchspöielen kann, muss man die Anleitung auf eine niedrige DIKW Stufe übersetzen also auf Stufe 2 und Stufe 1 (Daten).
Sowas wird über ein Text to action model realisiert. DAs erhält eine Karteikarte als Input und erzeugt dafür die Mausbewegung als Ausgabe.
Hier die simulierten Mausbewegungen für Karteikarte #1 innerhalb der SCUMM-Engine bei einer Auflösung von 320x200 Pixeln. Das json file enthält dieselben Anweisungen wie die textuelle Komplettlösung auch nur mit dem Unterschied dass es nicht auf dem DIKW layer 3 sondern auf dem untersten Layer 1 angesiedelt ist. Als Folge gibt es numerische Koordinaten die definieren wo genau der Mauscursor hinbewegt wird.
{
"card_id": 1,
"title": "Start",
"steps": [
{
"action_order": 1,
"description": "Walk to the front door area",
"command": "WALK_TO",
"target_coords": {"x": 160, "y": 140},
"wait_ms": 2000
},
{
"action_order": 2,
"description": "Pick up the door mat",
"verb_click": {"x": 40, "y": 170, "label": "PICK_UP"},
"object_click": {"x": 155, "y": 155, "label": "DOOR_MAT"},
"wait_ms": 1500
},
{
"action_order": 3,
"description": "Pick up the key under the mat",
"verb_click": {"x": 40, "y": 170, "label": "PICK_UP"},
"object_click": {"x": 155, "y": 155, "label": "KEY"},
"wait_ms": 1000
},
{
"action_order": 4,
"description": "Use key with front door",
"verb_click": {"x": 80, "y": 180, "label": "USE"},
"inventory_click": {"x": 300, "y": 170, "label": "KEY"},
"object_click": {"x": 160, "y": 100, "label": "FRONT_DOOR"},
"wait_ms": 3000
},
{
"action_order": 5,
"description": "Enter the house",
"command": "WALK_TO",
"target_coords": {"x": 160, "y": 90},
"wait_ms": 2000
},
{
"action_order": 6,
"description": "Go to the kitchen (right door)",
"command": "WALK_TO",
"target_coords": {"x": 280, "y": 120},
"wait_ms": 2500
},
{
"action_order": 7,
"description": "Open refrigerator",
"verb_click": {"x": 40, "y": 180, "label": "OPEN"},
"object_click": {"x": 100, "y": 100, "label": "REFRIGERATOR"},
"wait_ms": 1000
},
{
"action_order": 8,
"description": "Pick up the meat",
"verb_click": {"x": 40, "y": 170, "label": "PICK_UP"},
"object_click": {"x": 105, "y": 110, "label": "MEAT"},
"wait_ms": 1000
}
]
}
March 25, 2026
DIKW database for a warehouse robot
The following DIKW pyramid was simplified to only 2 bottom layers. Its stored in a json database with 2 different tables. The data layer stores the numerical sensor data like lidar_distance and battery voltage of the robot, while the information layer stored semantic tags. The task for the robot is to translate between both layers back and forth which is called symbol grounding.
{
"dikw_model": {
"data_layer": {
"lidar_distance_cm": 12.5,
"ultrasonic_proximity": 0.15,
"camera_rgb_average": [120, 120, 120],
"encoder_ticks": 4502,
"battery_voltage": 11.2
},
"information_layer": {
"spatial_context": ["obstacle", "near_field"],
"navigation_tag": "left_quadrant_blocked",
"surface_type": "concrete",
"status": "low_battery_warning",
"motion_state": "decelerating"
}
}
}
Line breaking algorithm in typst
To investigate if the current quality of typst typesetting fulfills the needs of an academic paper let us benchmark the algorithm for a complex example, which is 3 column typesetting. In a 3 column layout, each column is very small which makes it harder for the algorithm to avoid white spaces between the words.
Of course, there are some white spaces available but in general the output has an average quality. LaTeX would be able to reduce the white spaces by microtypography tricks not available in typst. Its a subjective opinion which of the systems is preferred.
March 24, 2026
Language parsing with a DIKW pyramid
A single sentence can be submitted to the dikw database and the database resolves the request so that it will become machine readable. Going upwards and downwards in the DIKW pyramid is equal to symbol grounding.
A DIKW pyramid consists of layers which are storing different sort of information. The lowest layer is accessible for a computer program and consists of location in a map, trajectories, sprites, tile maps and numerical color information. A possible entry might be [100,30] for a position in a map or (100,120,90) for a RGB color information.
On the next layer "information" a different sort of information are stored which are words. A word is a string which can be understand by a human but doesn't provide sense for a computer. For a human the word "wood" makes sense, but for a computer the same string is only an array of characters without any meaning. Its the task of the DIKW pyramid to link the word "wood" with a location in the map. The link allows the computer to resolve the meaning.
March 23, 2026
Language enabled artificial intelligence
AI resarch in the past was dominated by an algorithm centric bias. The goal was mostly to invent an advanced computer program which simulates intelligence. The idea was inspired by a chess engine which is searching in the game state for the next action. Robotics projects were engineered with the same objective. Notable algorithms are RRT for kinodynimaic planning or genetic algorithm for artificial life.
In the 1990s and until 2000s this paradigm was accepted as state of the art attempt to realize artificial intelligence. Unfortunately. none of the described techniques was successful. The robots are not working and the Artificial life simulation didn't evolve into a life form.
There was something missing until the 2000s for enabling artificial intelligence, and the missing element is natural language. On the first look this explanation doesn't make sense because there are lots of examples for text adventures and language understanding AI projects in the past, so the principle isn't new and can't be the explanation how to realize AI. Typical well known examples from the past are SHDRLU, the Maniac mansion game which was based on a simple 2 word 'English parser and a speech enabled robot from 1989 (SAM by Michael Brown).
The breakthrough technology after 2010 was to focus again on language guided robotics and implement these projects with more effort. Instead of programming a video game like Maniac mansion the goal was to program a text interface for robot control. Instead of realizing the parser in a simple C code, the parser is realized with a neural network. So we can say, that AI after the year 2010 has put natural language into the center of attention and created new algorithms and software around the problem. This attempt was very successful. It is possible to control robots with language and very important its even possible to scale up the approach so that the robot will understand more words and solve more demanding tasks.
From a birds eye perspective the situation until 2010 was to implement closed systems. AI was imagined as an autonomous system which is operating with algorithms and has no need to talk to its environment. In contrast, AI after the year 2010 works as open system. The robot receives commands from the human operator and sends signals back to the operator. In other words, modern robotics is always remote controlled with a text based interface. Its not possible to implement Artificial intelligence somehow else, but text based interaction is the core element of any AI system.
Its only a detail question how to program such an open system in detail. One attempt might be to utilize neural networks and learn the language from a dataset. Another attempt is to program the parser in a classical computer program without neural networks, while the third technique is to invent a domain specific language used for human to robot interaction. All these approaches have in common that natural language is the core building block. Natural language is used as an abstraction mechanism to compress the complex reality into a list of words. A typical entry level robot knows around 200 words to describe the environment including possible actions. These words are used by the robot to interact with a human operator. So the AI problem is mostly a communication problem, similar to transmit messages over a wire.
The paradigm shift from former computation into modern communication is the breakthrough technology for enabling artificial intelligence. Natural language used for human communication is also a powerful tool for human to robot communication. The English vocabulary is seen as a hammer for solving problems. Any problem in robotics gets reformulated into a language problem. There is no limit visible, but all the problems like biped walking, navigation in a maze and pick&place can be reformulated as language games.
Such kind of utilization of natural language wasn't available before the year 2010. The only thing known were isolated projects which explored if language might be useful for robotics. There was no understanding available that natural language is the core element in AI and needs to implemented in any possible robot or AI problem.

