It’s been a long time coming, but we finally have proper reinforcement learning support for OgmaNeo2!
Along with this release, we have some new demos to share as well as some explanation of how the new reinforcement learning system works.
Before we get started with explanation and demos, here is the repository: https://github.com/ogmacorp/OgmaNeo2 (use the RL branch).
The system performs very well while maintaining the same practical benefits as the prediction-only version: Fully online/incremental learning without forgetting and super-fast training and inference. While the demos we are going to share in this post do not fully explore all the benefits of the algorithm, they attempt to simply show “cool stuff” you can do with OgmaNeo2.
So now for a rundown of the reinforcement learning system. If you need a refresher on how the prediction-only version of OgmaNeo2 works, see this slideshow presentation.
In order to add reinforcement learning capabilities to Sparse Predictive Hierarches (SPH, the algorithm implemented by the OgmaNeo2 library), we have tried many, many different methods. Many performed “alright”, some not well at all. The theory always seemed sound, but sometimes that does not translate to practice. Here is a summary of the 3 main categories of algorithms we have tried:
Goal-based: The goal-based methods we tried are not really reinforcement learning, but rather a system for chasing certain target CSDRs (Columnar Sparse Distributed Representations, the state representation). The idea is to have each layer use some form of either:
- A transition matrix across CSDRs coupled with a pathfinding algorithm that produces joint state/action pairs that lead to a target goal state specified by the layer above’s prediction (or the user).
- An associative memory that correlates transitions with certain target goal states which are then recalled by layers above.
While these methods worked well on tasks like mazes, they have a fundamental drawback: They can only chase a “point” target state, rather than a distribution of states.
Routing-based: This one is based on an old idea I had described in a blog post several years ago, where one creates a sparse representation of some sort and then “routes” a regular deep learning-style neural network through it.
This method therefore does away with the decoder in the original SPH, and keeps the encoder hierarchy (along with the exponential memory system). The encoders still learn unsupervised from the incoming data stream. Alongside the encoders is a deep linear neural network that mirrors the shape of the encoder hierarchy. When activating and learning, this “routed” deep network only is activated/learned on the “on” units selected by the mirror encoder hierarchy. The encoders therefore “route” the deep network.
Now the deep network is linear on purpose – it could be made nonlinear, but this is unnecessary, since the encoders already split up the state space in a nonlinear fashion, which the routing translates onto the deep linear network. Since it is linear, we can set it up so that it does not have vanishing gradients (by making the “neurons” of the network average the inputs and making all weights initialize close to 1). Further, since it is linear we can use the following reinforcement learning setup:
- The deep linear network produces value estimates at the “top” of the hierarchy.
- Actions are determined by simply backpropagating an error of 1 to the inputs (bottom), a section of which is dedicated to “action-inputs”. Since the network is linear, backpropagating a positive error will always select the best action (nonlinear networks would require several “solving iterations” to achieve something similar).
This method is the most complicated of the three described here. It is also not biologically plausible, as it uses backpropagation.
Swarm-based: The swarm-based approach uses a local reinforcement learning method, where the decoder of the SPH is replaced with a swarm of reinforcement learning agents that all seek to locally maximize the same reward.
Each swarm agent is the equivalent of a decoder output column in the original SPH. Currently, we used a type of actor-critic algorithm to train the agents, as we found this worked best in practice. The agents use Boltzmann exploration.
This system can be thought of as a self-organizing swarm of reinforcement learning agents bound by a kind of neural “scaffolding” provided by the encoder hierarchy.
Of the three methods described, the Swarm-based agent currently performs the best overall, and is the one which the demos use.
Let’s start with something simple. Lunar Lander is an OpenAI gym environment where the agent must learn to land a spaceship.
Using our system, we can achieve an average per-episode score of around 100 at around episode 1000. If we train it even longer (~3000 episodes), it eventually gets an average score of around 200.
Here, we train a simulated version of the Ghost Minitaur robot part of the PyBullet library. The goal is to move as fast as possible. While it doesn’t look very natural, it goes fast enough to consistently fall off the edge of the map.
Real-World Mini-Sumo Robots
Now for the big finale! We trained some real mini-sumo robots to play against each other. The robots are tracked using a camera, and the coordinates are encoded and fed to two OgmaNeo2 agents. They receive a negative reward when going out of the arena, and a positive reward if the opponent goes out (weighted by distance to the opponent when it loses). Both also receive a small reward for moving close to each other generally, in order to speed up the learning process.
The robots also automatically reset themselves when a game (episode) terminates. They find their way back to their starting points, avoiding each other in case of blockage. This makes the game relatively human-interference-free (although sometimes they get pushed out of camera range).
That wraps it up for this update, there are some more demos we are working on. Here’s a sneak peak!