December 07, 2025

AI pose estimation

 In comparison to large language models, a pose estimation on a computer sounds not very interesting. The algorithm detects that the person on the screen stands on one leg or lfts the arm. Such kind of algorithm can be realized with a small neural network and even with a hand coded software routine. On the other hand there are some arguments available why AI pose estimatiuon is an underestimated technology and should be described in detail.

Its correct, that AI pose estimation itself looks a bit boring, but pose estimation is the required step for humanoid robotics. The same algorithm which converts a scene into text can also convert text into a scene. This is called an AI animation system, the user enters a description like "show the hand with 5 fingers" and the character on the screen is doing so. Such kind of text to animation system is only one step before the realization of humanoid robotics. A humanoid robot can do the same task in the physical reality. The human operator may say "stay on left leg" and the 2 meter large robot is doing so.

In all these cases the principle has to do with translation between a pose and textual prompts. The ability to map a pose to text allows a machine to understand the meaning. Text can be stored in a small amount of RAM and allows to utilize other AI algorithm on the system. A longer robot sequence is generated by providing a longer text stream. Robot control means usually to specify in english sentences what the robot is doing next. These text commands are converted into poses and then into animation.

The single important technique is the ability to use natural language as an abstraction mechanism. Every possible pose is described with words like "leg, knee, hand, left, right, up, down" and so on. These English words are reducing the complexity which is the major problem in Artificial Intelligence. Instead of planning low level trajectories or trying to search in the state space of 3d poses, the elaborated alternative is to focus on textual description and program a text to image convert on the second layer.

Technically, a pose estimation software including a text to animation system can be realized with outdated hardware from the early 1980s. Even an 8bit homecomputer like the C64 is more than capable in doing so. The challenge is not the programming task but to recognize that pose estimation leads to robotics.

No comments:

Post a Comment