In our last tutorial, we covered prediction using OgmaNeo2. We will now cover how to perform reinforcement learning using OgmaNeo2.

### Actions as Predictions

In OgmaNeo2, actions are treated as a type of prediction, and are used in a similar fashion. One can view the sequence of actions taken as just another stream to predict. Of course, we cannot just predict our own action sequence without modification (which would start out random), so OgmaNeo2 provides a simple method to deliver a reward to the hierarchy such that the predictions are “bent” towards ones that receive higher rewards.

As shown in the previous tutorial, we can loop predictions back in as input. Actions do this all the time – we step the hierarchy, retrieve the action-prediction, take the action (may be modified by the user with additional exploration for instance), and then pass that action back in at the next timestep. Since each action-prediction is still a (t + 1) prediction of the last (t – 1) action taken, we are essentially just retrieving the action at time (t).

### Cart-Pole

Here we will show you how to train an OgmaNeo2 agent to balance the pole in the OpenAI Gym Cart-Pole example. As in the last tutorial, we will again be using the Python bindings.

After installing both OgmaNeo2/PyOgmaNeo2 and Gym, we begin by importing the things we need:

```
import pyogmaneo
import gym
import numpy as np

```

We use Numpy in this example to help simplify the manual pre-encoding of the observations. Since we don’t know the bounds of the observations beforehand, we will use a squashing function (sigmoid) in order to scale the observation into the (0, 1) range for CSDR conversion.

```
# Squashing function
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))

```

We then create the environment and grab the appropriate action and observation dimensions from it. We know the observation is only 4 values, so we could either pack it into a (2 x 2 x colSize) CSDR or a (1 x 4 x colSize) CSDR. We will use the latter for simplicity. The action is just 1 integer, so we just make a single-column CSDR for it where colSize = numActions.

```
# Create the environment
env = gym.make('CartPole-v1')

# Get observation size
numObs = env.observation_space.shape # 4 values for Cart-Pole
numActions = env.action_space.n # N actions (1 discrete value)

# Squashing scale multiplier for observation
obsSquashScale = 1.0

# Define binning resolution
obsColumnSize = 32

```

Creating the hierarchy:

```
# Set the number of threads

# Create the compute system
cs = pyogmaneo.ComputeSystem()

# Define layer descriptors: Parameters of each layer upon creation
lds = []

for i in range(2): # Layers with exponential memory. Not much memory is needed for Cart-Pole, so we only use 2 layers
ld = pyogmaneo.LayerDesc()

# Set the hidden (encoder) layer size: width x height x columnSize
ld.hiddenSize = pyogmaneo.Int3(4, 4, 16)

ld.pRadius = 4 # Predictor radius onto sparse coder hidden layer (and feed back)
ld.aRadius = 4 # Actor radius onto sparse coder hidden layer (and feed back)

ld.ticksPerUpdate = 2 # How many ticks before a layer updates (compared to previous layer) - clock speed for exponential memory
ld.temporalHorizon = 4 # Memory horizon of the layer. Must be greater or equal to ticksPerUpdate

lds.append(ld)

# Create the hierarchy: Provided with input layer sizes (a single column in this case), and input types (a single predicted layer)
h = pyogmaneo.Hierarchy(cs, [ pyogmaneo.Int3(1, numObs, obsColumnSize), pyogmaneo.Int3(1, 1, numActions) ], [ pyogmaneo.inputTypeNone, pyogmaneo.inputTypeAction ], lds)

```

Different from the prediction tutorial, we now have two CSDRs as input – the observation and the action, of inputTypeNone and inputTypeAction respectively. We don’t need predictions for the observation, so we leave that out. We also set the action radius (aRadius), which is a receptive field radius specific to action layers. The rest should be familiar from the previous tutorial.

We create our episode loop:

```
reward = 0.0

for episode in range(1000):
obs = env.reset()

# Timesteps
for t in range(500):

```

The environment stops after 500 timesteps (the highest possible score, longest balance time). We will perform 1000 episodes of training. We reset the environment at the beginning of each loop and retrieve the starting observation. We also keep track of the last reward received.

Next, we need to bin the observation into a CSDR. As mentioned in the previous tutorial, there are more elegant ways of doing this, but binning will be good enough for now. We will use the sigmoid function we defined earlier for squashing, and some simple numpy functions to get a list of integers (a CSDR).

```
# Bin the 4 observations. Since we don't know the limits of the observation, we just squash it
binnedObs = (sigmoid(obs * obsSquashScale) * (obsColumnSize - 1) + 0.5).astype(np.int).ravel().tolist()

```

It’s time to step the hierarchy. Here we feed in 2 CSDRs, as we defined in our initialization: The observation, and the last action produced. We also enable learning and provide the reward signal.

```
h.step(cs, [ binnedObs, h.getPredictionCs(1) ], True, reward)

```

We can now retrieve the action and step through the environment:

```
# Retrieve the action, the hierarchy already automatically applied exploration
action = h.getPredictionCs(1) # First and only column

obs, reward, done, info = env.step(action)

```

Cart-Pole is played “better” the longer the cart balances the pole, so we will just print out the episode number and survival time for now. If you want to graphically see the pole balance, you can call env.render() at the beginning of the timestep loop after a certain number of episodes.

We will also re-define the reward from the default reward, which is all 1’s. We cannot learn from a constant stream of reward = 1, so we define reward to be 0 until the episode is done, for which it gets punished (reward = -1). This will encourage it to maximize the time between punishments, and thus maximize survival time.

```
# Re-define reward so that it is 0 normally and then -1 if done
if done:
reward = -1.0

print("Episode {} finished after {} timesteps".format(episode + 1, t + 1))

break
else:
reward = 0.0

```

And that’s it! When run, you should see the survival time increase, until it caps out at the maximum of 500.

The full example code is available here.

### Mimicry/Imitation learning Initialization for RL

While not applicable to the cart-pole example, it is often advantageous to start off a reinforcement learning agent with some prior knowledge. We recently demonstrated this with our “learning to walk faster” quadruped robot demo.

In order to initialize an agent with prior knowledge, we can use an additional flag in the hierarchy’s step function. The last parameter (unusued in the cart-pole example, it defaults to false) is the “mimic” flag, which when true makes the action input layers temporarily function like prediction input layers.

```
h.step(cs, ..., learnEnabled, reward, mimic)

```

This is a sort of “passive” mode for the reinforcement learning, where it observes the actions that you manually provide to it, and associates the reward with. Once done pre-training, simply set mimic=False again, and then perform reinforcement learning as normal.

This feature is great for giving prior knowledge to an agent, such as a kinematic model for a quadruped robot (as we did).

### Some Tools

If you want a quick-and-dirty automatic initialization of a OgmaNeo2 SPH on a Gym task, you may try using the EnvRunner included in PyOgmaNeo2. It will take most gym tasks and automatically create a hierarchy and appropriate pre-encoders. See the CartPole_EnvRunner.py example for how to use it.

Finally, if you ever want to see what’s going on with the CSDRs and encoders in your OgmaNeo2 hierarchy, consider trying NeoVis. It provides a simple way to visualize the contents of a SPH, and generally only requires 2 lines of code to be added to your application to use it.

Until next time!