Acting without Rewards

Hello,

While we continue to work on improving our reinforcement learning (RL) (on two new demos!), here is some information on what else we have tried aside from reinforcement learning for performing tasks with agency. For now regular old RL performs better than what I am about to describe, but perhaps at some point this technique will be useful.

Some of this post assumes some basic familiarity with OgmaNeo. See this presentation for a quick primer!

This new technique came from the observation that animals are not “pure” reinforcement learners. Reinforcement learning seems to be the “final decision maker”, but most of what the brain does, even for agency, seems to be unsupervised. We seem to be “information sponges” that learn from everything, even if not useful for maximizing a reward (at the moment). How can we efficiently learn from everything, and then use that information to achieve some (now unknown) task later on?

A solution we have been working on with some promising (but not state of the art) results is something we call “unsupervised behavioral learning”.

The basic idea behind unsupervised behavioral learning (UBL) is that any interaction an agent has with the environment can be “clustered” into something that it can perform repeatably. Given some set of state-action transitions, we can remember what led to what, and do that same thing again if the “task vector” those transitions were associated with is presented again.

This may sound similar to hindsight experience replay. However, it is not – there is no replay, and we don’t use Q learning or SARSA at all. Further, unlike HER, we are not still optimizing for reward internally. Goals do not need to be generated during training, and we have no restriction of training regime (doesn’t need to be episodic). The job of selecting which behavior to perform to do whatever task is desired is left to other systems.

So how does UBL work? Let’s take a very high-level look.

UBL is based on Sparse Predictive Hierarchies (SPH, as implemented in OgmaNeo). However, it does not necessarily have to occur in this framework, it just has some very convenient features for implementing this.

A UBL system is composed of layers. A high-level diagram of such a layer is shown above. The operation of a layer works as follows:

Take in the state vector.
Form some representation (compressed) of the state vector.
If learning, use the next representation (can by found by waiting one timestep) and perform an association between the current (t) representation vector, the next (t + 1) representation vector, and whichever action occurred at (t).
If inferring, recall the action associated with the representation vector (t) and the task vector (replacing the representation vector at (t + 1) from learning)

The key here is that when learning, we just associate a transition (representation vector (t) to representation vector (t + 1)) with whatever action it took. This action is typically just the result of the inference from the previous timestep, with some added exploration.

Now what is interesting is since we learn solely based off of state transitions (“off policy”), we can take it through any trajectory, take the resulting representation (t), and then do “more of that” by using that as the task vector.

You may be wondering: How does this work when something requires multiple steps to achieve? Well, the solution we found is to use the exponential memory from OgmaNeo2. This allows each layer to address slower and slower timescales. Each layer then passes its action(s) as the “task vector” to the layer below. So, one must only read and specify the task vector at the top of the hierarchy, which transitions very slowly and covers many timesteps (long behaviors).

In OgmaNeo2, we map the concept of the state, representation, action, and task vectors to CSDRs. The associator is essentially the decoder portion of the OgmaNeo2 layer, and the representation vector learning is handled by the encoder. So really, all we need to change from a regular SPH is to make it learn off-policy by not learning off of feedback (task) but only state transitions. Of course, several additional tricks were necessary to get this working at all (these will be described in the future once further refined), but the resulting system can perform some interesting tasks.

How is the task vector found? Well, since it mirrors the representation vector, we can read out the state of the highest layer and learn which state leads to the most reward, for instance (if one wants to adapt this to perform RL-style tasks). For some environments, though, we can avoid using a reward altogether, when the task is known.

The maze was our first test. We first train the agent by taking random actions in a maze. It therefore learns completely off-policy. After the training is complete, we teleport the agent to some target location, have it sit there for a bit (since the representation vector at the top of the hierarchy covers several steps of behavior), and then record that representation. We then can teleport the agent back to where it was previously, and the agent will find its way to the target location. Since it is doing this based on visual information alone, it may get stuck in locations that look similar to the target one.

Simple maze task. Green – walls, blue – visual landmarks, white – player, red box – field of view.

However, it is interesting that this works at all – there is no “discounting”, pathfinding search, or other well-known mechanism to tell it which paths are shorter. Rather, the exponential memory and choice of task vector handles this – and as a result, it is able to pathfind reasonably well. It is a bit noisy, but this seems to be specific to the way we implemented it.

We feel this is a class of algorithms worthy of further investigation. While not amazing now, the idea clearly does work. There are still many tricks that can be employed to improve the result.

That’s all for now!

Acting without Rewards

Related

Leave a Reply Cancel reply

Share this:

Related

Leave a Reply Cancel reply