Hello!
We have been working hard on the latest version of OgmaNeo, and in the latest version (v1.3) introduced something we call “Exponential Memory”. We believe this to be an important step forward that applies not only to what we have in OgmaNeo already, but also to Deep Learning in general.
In simplest terms, Exponential Memory (EM) is a form of spatio-temporal ladder network where each additional layer added multiplies the memory horizon of the algorithm, and where each additional layer incurs an exponentially lower performance cost. EM views the world at multiple timescales, and forms predictions at each level, which are used to inform lower layer (faster) predictions.
This may sound like DeepMind’s WaveNet, but it is not – WaveNet is feed forward (not a ladder), requires all previous inputs to be recomputed every step, and does not run higher layers at slower clock speeds (leading to well-known performance issues).
EM is an alternative to standard recurrent neural networks. It only works on ladder-like architectures.
So how do we implement the above in terms of a ladder network? We already have a hierarchy in space when using a standard ladder network, but we do not have a hierarchy in time. To create a hierarchy in time, we can slow down higher layers with some multiplier. Each layer “clocks” at some multiple of the previous layer, typically 2. Therefore, each layer requires half the processing time of the previous. Each layer also takes in multiple timesteps of input now – for the common case of 2 steps per clock, we take in 2 or more sequential inputs before updating a layer. The result is that the higher the layer in the hierarchy, the slower it clocks (and therefore requires less updating/processing), and the more time it covers. We call this “striding”.
The previous paragraph describes the “up” pass of a ladder network – but what about the “down” pass (prediction)? To form predictions, we need to predict off of a layer while taking higher layer feedback into account. Fortunately, this can be done elegantly in EM without any sort of backpropagation. Each layer takes not only several timesteps of input, but also produces several timesteps of predictions. So for feedback, we simply select the prediction from the higher layer that corresponds to the current layer clock. We call this process “destriding”.
EM is a bit difficult to visualize, but we took a go at a diagram to aid in understanding it:
So what are the implications of EM?
For starters, it allows one to maintain a history of inputs to predict from that is exponentially large with respect to the number of layers. This means that we can remember information for very large amounts of time. If my calculations are correct, 64 layers would allow you to remember 42 times each second of history our universe has ever had! On top of that, you can never exceed 2 times the cost of the first layer, given that we are striding by 2 and each layer is the same size.
It would of course make more sense to only stride every few layers, to get more layers per timescale. Still, EM should provide large savings in processing time (hopefully, depending on the task).
EM is currently implemented in our OgmaNeo framework. We are still making improvements, but we can already recall information for very large time gaps, and run the thing on a Raspberry Pi 3!
We are also looking at implementing a version of EM using more standard Deep Learning tools such as convolutional autoencoders with backpropagation.
Until next time!
Fascinating reads about the Feynman Machine and EMs. You have a paper on EMs as well?