July 16, 2019

Qlearning simulates a physics engine


The amount of tutorials about qlearning is huge. The best one http://mnemstudio.org/path-finding-q-learning-tutorial.htm explains qlearning as a graph search technique. I'd like to describe the idea from an abstract point of view. Every game needs a game engine. The game engine will answer what-if questions. In case of qlearning the game engine aka forward model is stored in the Reward matrix. The reward matrix is static, which means it is always the same and contains the game rules.
What makes q-learning unique is, that the game engine not only provides the follow up state but also a quality value, if the action makes sense or not. The q-value is stored in the q-matrix and gets improved by an algorithm. Let us give a short example.
Suppose, the robot has to traverse a graph. He is at the root node and likes to walk to the goal node. The possible links in the graph are stored in the forward model of the game. This forward model is given by the R-matrix but can be implemented without a matrix but in sourcecode as well. The forward model defines, which actions are possible in each situation, that means how the game can be played. Now we can introduce the learning aspect of q-learning, which is given in the Q-table. Creating a q-table means to annotate the graph with a cost function. The robot has the choice between different follow up nodes, and the q-value simplifies his decision. The computer theory term is weighted graph.
Basically spoken, qlearning allows to learn a weighted graph. At the beginning only the links between the graph is unknown but the costs are unknown. And after the training is over the graph is described in detail. The robot can follow the path with the lowest costs and this will direct him to the goal. The standard algorthm for find the shortest path in a graph is called Dijkstra's algorithm. It's a path planner for known costs of the edges in the graph. The interesting point is, that Dijkstra's algorithm, A*, and RRT have all the same efficiency. That means, they will find the shortest path in a low amount of time.
What i want to explain is, that q-learning and A* is the same technique. It is used to search in a graph for a goal node. The performance is the same. All the qlearning toy problems (for example a robot in a maze) can be solved with A* like algorithm in the same amount of time. The term learning within the qlearning paradigm means, to determine the fast way in traversing the given graph. So in reality, qlearning is a graph search algorithm.
Bottleneck of qlearning
The bottleneck of qlearning is the same like in all graph search algorithm. It works only if the number of nodes is small. That means, a problem which contains of 100 nodes can be handled easily with A* or with qlearning. The algorithm will calculate a bit in the state space and it will find the optimal path in graph. If the number of nodes is greater, then A*, RRT, and especially qlearning will fail. It means, the qmatrix can't be created, and the algorithm will need to much cpu ressources.
If qlearning is only a graph search algorithm why does the algorithm works so great? Because graph search is a powerful problem solving technique. If a problem was converted into a graph, and the graph is searched reasonable fast, then the software is able to figure out the correct actions. The question is not, why graph search is able to play a game alone, but the more important question is under which conditions the strategy fails. There are some bottlenecks available which are converting the problem into a gametree and handling larger trees. The normal graph traversing algorithm doesn't answer this problems and that's the reason why qlearning, A* and RRT are useless for most complex tasks.