The OpenAI gym library is the perhaps most important reinforcement learning project available. It provides an out-of-the-box environment for simulating control problems and it gives advice how to solve them with algorithms like q-learning and neural networks. The only problem available is, that no documentation is available how to control the example domains like the inverted pendulum problem.
Let us start with the basics because many newbies doesn't know how to run the simulation in general. After installing the openai library in a Linux or Windows operating system the programmer can utilize the library in Python. A simple example is given in the following sourcecode.
import gym
import time, random
class Plaincartpole:
def __init__(self):
self.env = gym.make('CartPole-v0')
observation=self.env.reset()
for framestep in range(100):
self.env.render()
action=random.randint(0,1)
observation, reward, done, info = self.env.step(action)
print("observation",observation, "reward",reward)
time.sleep(0.5)
if __name__ == '__main__':
p=Plaincartpole()
Apart from the “gym” library itself, two extra Python libraries are imported for generating random actions and for slowing down the simulation. After executing the python script the user should see an inverted pendulum on the screen which is doing something.
On the terminal, the status is shown which contains of the measured features itself and a reward information. This first python script is nothing new or special but most openai Gym tutorials are working with this example. In the for loop of the script, the frame counter is increased and each time step is send to the graphical screen.
After this trivial example is running, the more complicated question is how to control the pendulum. From a technical perspective, the user send the actions (0=left, 1=right) to the pendulum. This will affect the system and the pendulum will swing into a certain direction. It is important to know that the reward will change from 1.0 to 0.0 if the pendulum has reached an angle of greater than 45 degree. That means the game has stopped and the control problem wasn't solved.
For generating a sequence of actions which can stabilize the pendulum the first thing to know is, that the reward provided by openai gym is the bottleneck. The built in reward function isn't providing useful feedback but it is simply a check if the angle is larger than 45 degree or smaller than -45 degree. A second problem with the reward function is, that it can become only 0 or 1 but no value inbetween. This problem can be fixed easily with a self created reward function.
import gym
import time, random
class Plaincartpole:
def __init__(self):
self.env = gym.make('CartPole-v0')
observation=self.env.reset()
for framestep in range(100):
self.env.render()
action=random.randint(0,1)
observation, reward, done, info = self.env.step(action)
# handcrafted reward function
reward=1-abs(observation[2])
if reward<0: reward=0
print("observation",observation, "reward",reward)
time.sleep(0.5)
if __name__ == '__main__':
p=Plaincartpole()
The new reward function measures also the angle of the pole but it provides a more elaborated information. If the pendulum is in the upward position the reward is 1.0 and if it is a bit rotated then the reward is 0.8 and so on. The idea is that the original reward function from the open AI gym environment is overwritten by a self-created function.
This handcrafted reward function can be modified according to the needs of the programmer. The example shows only a very basic version. It is possible to improve it for example by checking if the cart is outside of the visible playfield.
The idea is that random actions are send to the system and then a reward is determined. Before it can be determined what the optimal control action, it should be defined what the goal is. The goal is formalized in the reward function.
Let me give an example. Suppose the goal is to bring the cart into the middle of the playfield. The reward function would be:
reward=1-abs(observation[0])
if reward<0: reward=0
That means, the feature in the observation variable is converted into a numercial value. This value is 1 if the cart is in the middle, it will decrease to 0.5 if the cart has left the middle and it will be 0 if the cart is outside of the allowed range. In between values are also provided, so it is a continuous reward function.
Or let me give a more advanced example. If the pole should be upwards and the cart should be in the middle the combined reward function is:
rewarda=1-abs(observation[2])
if rewarda<0: rewarda=0
rewardb=1-abs(observation[0])
if rewardb<0: rewardb=0
reward=(rewarda+rewardb)/2
Example
Suppose a reward function was created which determines the position of the cart and ignores the angle of the pole. If the cart is outside of the playfield the reward become zero. The idea is that is cart is moving left or right and while the cart is doing so the reward is shown on the screen. It some sort of score like in a videogame.
That means all the other features which are stored in the observation variable are no longer interesting but only the reward value is monitored. Winning the game means to maximize the reward. And different reward functions will result into different games. A controller which maximizes the cart position reward will produce actions in which the cart is always in the middle and it will never leave the playfield. So the reward function is some sort of constraint which defines what the problem is about.
The interesting point is, that after changing the reward function the game engine remeains the same, that means, the pendulum will fall with the same speed like before. The only new thing is that the reward score is determined different.
Suppose there is a universal policy available which maximizes the reward function. The actions generated by this policy will depend on the reward function. that means, after adjusting the reward function a new behavior is shown on the screen. Or let me explain it the other way around. The forward model of the gym environment aka the simulation remains the same, and the policy which converts a reward signal into actions is also the same. The only variable is the reward function which is handcrafted by a human programmer.