In a previous post, I described an alternative to reinforcement learning (RL) called Unsupervised Behavioral Learning (UBL). In short, instead of maximizing rewards, it seeks to match its current state to some goal state (which is spatio-temporal).
We have decided to return to the idea with a new real-world demonstration. This latest iteration of UBL is based on AOgmaNeo, the most up-to-date implementation of Sparse Predictive Hierarchies (SPH) as of the time of writing. Along with better all-around performance through AOgmaNeo, we also updated the UBL algorithm a bit. We are still working out the best version, but we already have some interesting results to share.
In the video below, we trained the latest version of our “World’s Smallest Self-Driving Car” (v4) to act as a “rat” in a simple cardboard T-maze. The walls have some colored markings, which aside from helping the robot identify its location also serve as goal states. After driving the robot around the maze by hand semi-randomly, we can move the robot to a recognizable landmark (in front of a colored marking) and save the top-level state of the hierarchy as the goal state (a CSDR) by pressing a button. If you don’t know what a CSDR is, we recently made a guide for the regular edition (master branch) of AOgmaNeo. The rat robot can then be moved to some other location in the maze, and it will try to return to the goal state. As is usual with SPH, all processing happens on-board the Pi Zero, including online training. The only sensor the rat uses is a small fish-eye camera.
Since goal states come from the top-most CSDR in the SPH hierarchy, they are spatio-temporal. This means they can capture more than static states, but entire behaviors as well. In this case, this feature isn’t really needed, as we want the robot to just sit still in front of the colored markings. We do then however have to make sure the robot has been sitting still for a bit when determine the goal CSDR (just a few seconds of sitting still is enough).
We are also working on an adapter for using UBL as a regular RL agent. We will likely be basing it on Bandit Swarm Networks (BSN), as these are good at finding rewarding static configurations such as goal states. Hopefully that will be working properly in the next blog post!
Until next time!