February 11, 2020

Extending learning from demonstration into Reward learning from demonstration

The vanilla learning from demonstration idea is about recording the human's demonstration and replay the trajectory on the robot. For example, the human operator is executing a trajectory (100,10),(150,30),(150,80) and this waypoints are used for control the robot's arm.

The disadvantage is, that the connection between demonstration and replay is very static. One option to avoid the cons is to use the demonstration as an indirect pathway. In the literature the concept is called reward learning and the idea is to create a heatmap. The heatmap allows to find many different trajectories which are all bring the robot into the goal state.

A heatmap aka costmap is a visual representation of a learned cost function. The idea is that colors from green to red are shown as overlay picture over the normal map. The information which pixel becomes which color is given by the demonstration of the human operator. Bascially spoken, the human demonstration creates a path in the map, and the path is extended to a colored heatmap. A trajectory planner like RRT is used to find in this map a path.

Clicker training with dogs

In animal training there is a powerful technique available called clicker training. For the newbie the technique is hard to understand. The human trainer is using a noise making device, and feeds the dog with some cookies. After a while the dog is able to do lots of tricks. But how does it work from a technical perspective?

Teaching skills can be done in two forms: direct and indirect. Suppose the idea is to explain who to move from start to goal. This can be done in giving the direct command. At first, the dog has to walk 10 meter ahead, and then he has to go left for 5 meters. The problem with this method is, that the explanation can't be adapted to new situations. For example, if the pathway is blocked, it makes no sense to walk 10 meter ahead. So the question is how to give a tutorial which is more flexible?

Clicker training is working with a cost map. What the human trainer is doing with the noise making device is produce a cost map for the dog. He sets reward points on the map. A reward is a situation in which the dog gets a cookie. The dog labels the point on the map with the positive reward. In the replay mode, the dog is approaching all the +1 rewards on it's reward map and this will make the human trainer happy.

In case of spatial maps, it's not very complicated to imagine such a map. In abstract situation the map looks more complicated. For example, if the goal is not to reach a point in space, but to walk in a circle, it's an abstract behavior. If the dog is smart he can create the cost map for such abstract tasks as well.