A respectful critique of the Hierarchical Temporal Memory (HTM) concept

[Audio Version]

I've been away from this too long, distracted by other things in my life. I've missed it. Lately, I've been finding myself getting excited again to the point of getting distracted from those other things and back in this world.

The most interesting development in the world of artificial intelligence of late, to my thinking, is the recent release of Numenta's Hierarchical Temporal Memory algorithm, the brainchild largely of Dileep George and inspired largely by Jeff Hawkins, author of On Intelligence. Having been so disappointed by artificial neural networks, expert systems, and various other "traditional" approaches to AI, I found the ideas presented by Hawkins refreshing and exciting, so I joined Numenta's mailing list and eagerly awaited the arrival of its promised products.

Now that the NuPIC platform and related tools have been released, Numenta has also authored various white papers on how it actually works. In refreshing contrast to the mind numbing gibberish of some proprietary systems' (e.g., PILE's) white papers and math-heavy tomes on Bayesian networks and neural networks, these documents present a clearly understandable description of what HTMs actually do and how they do them. The one I found most penetrating was coauthored by Dileep George and titled The HTM Learning Algorithms. So far, this is the best document I have read on the subject, though admittedly, it helps to be familiar with the HTM concept at a high level.

I am about halfway through reading this 44-page PDF. I had to stop in part because my brain couldn't focus any more on it because I'm distracted by my own work and, frankly, inspired by what I've found in this document. I finally "get it", how an HTM learns, which I've been missing for the whole time I've been aware of HTMs. But to my surprise, I found there are some troubling questions I've formed already in the process that I want to document before I forget. I want to pose them here to help further the discussion of the value of HTMs and perhaps promote their improvement.

Section 4 describes how an HTM node is exposed to a continuously changing stream of data and learns to recognize "causes". In this example, however, there are very tight constraints. The application used is called "Pictures" and involves learning to recognize pure black and white line drawings of simple symbols like letters and coffee cups. This section focuses on learning in the first layer, in which each HTM node can see a 4x4 grid of B&W pixels. The sample drawings used are all composed of very simple elements like vertical or horizontal lines, "L" joints, "T" joins, "Z" folds and line ends. In order to make sure the HTM properly learns to recognize these constructs in many situations, this HTM is exposed to examples of each in many positions in its 4x4 visual field. This is done by showing it (and all the other HTMs in this level) "movies" of the archetype drawings moving in various directions and at different scales (zoom factors).

Now, I know it's important to reduce a general problem to a narrower problem in order to help test, quantify, and explain a concept. So I'm willing to suspend a little skepticism. But as I read on about the nuts and bolts, this came back to bug me again. In order to learn to recognize that many variations of a pattern all represent the same pattern, HTMs rely critically on a temporal component for learning. Let's say in moment T1, the node is exposed to a picture of an "L" joint and in moment T2, it's the same L joint, but shifted to the right one pixel. The fact that these two distinct patterns were seen in adjacent time steps suggests they have the same "cause" and so get lumped in together. Later, when the HTM sees either of these two versions of the "L" joint, they will report it as the same thing, which is super cool.

But here's one problem. Before an HTM can even begin noticing that the two "L" joint patterns appear one after the other, it's necessary for the HTM to undergo a "long" learning process just to recognize the distinct patterns, which here are called "quantization points". In the learning process, the HTM is exposed to a long series of these "movies" of all the sample images moving around relative to the HTMs. In that process, all unique pixel patterns a level 1 HTM is exposed to are recorded before it moves on to learning which ones are related to one another. Every single pattern! Now, with a 4x4 black and white grid, there may be 2^(4x4) or 65,536 unique patterns. Since the source data fed into this program is limited to these very clean, rectilinear patterns, the actual number of unique quantization points recorded in this first phase is only 150. If there were curves, different angles, and "dirt" in the source images, the number would clearly be much higher. Honestly, this leaves a bad taste in my mouth, as I can't imagine gathering together all examples of rich source data as a good prerequisite for beginning to classify things, nor a resource responsible way.

Now, one of the points of an HTM in this Pictures application is that it can learn to recognize that all "L" joints are the same thing without any prior knowledge of that. The key ingredient in the HTM recipe is this temporal coincidence. So once all 150 distinct mini-patterns, or quantization points, have been identified by watching the source images moving around in various directions against the field of view, the next step is to construct a 150 x 150 matrix initialized with all zeros. The rows and columns both represent each of the quantization points, but one represents seeing one in T1 and the other represents seeing it in T2. So lets say quantization point Q1 represents an "L" joint and Q2 represents another L joint shifted one pixel to the right of Q1. As the movie progresses from T1 to T2, we find the cell in the matrix where row Q1 and column Q2 meet and we add 1 to it. After a lot of this process, we end up with a matrix that has very high numbers in a few cells that represent lots of coincidences of quantization points in time, like our two L joints and a large portion of the matrix still with zeros. The reason for doing this is that there must be some way to say that Q1 and Q2 are related; that's the point of an HTM, and coincidence in time seems a good way.

An HTM has a finite number of outputs, each of which represents a "cause". The developer gets to decide the number. The more there are, in theory, the more nuanced the known causes can be. The next step of the learning process, then, is to decide what those causes are. Let's say for example there can be at most 10 "causes" that can be output. The 150 quantization points each get assigned to one of these 10 causes in a process that's a bit hard to understand. It's probably best to read section 4.2.2, "Forming temporal groups by partitioning the time-adjacency matrix", for a precise explanation. But one summary way of explaining it is that this algorithm starts at one quantization point that has the highest number of temporal connections (as represented in the 150 x 150 matrix) to others and follows along the really strong connections to other quantization points, lumping them together into one group. In theory, the connections branching out get sufficiently weak that the algorithm stops following them. Then it moves on to the next remaining quantization point that has the highest value in the matrix and continues on (ignoring all other quantization points that have already been grouped). This continues until either all quantization points with connections above a certain threshold are exhausted or we run out of groups (our maximum of 10 causes). The authors point out that this is not the only way to do grouping, but it's a pretty ingenious way to quickly allocate causes.

This learning algorithm is truly ingenious. I love it. And yet it bothers me, too. For one thing, this specific algorithm only cares about the coincidence of patterns from one discrete moment to the next. For another, its performance seems to rely very strongly on tight constraints on the data. As the data is allowed to become less constrained -- going from perfect right angle lines to allowing curves, allowing thicker lines, allowing dirty data, rotating in 3D, allowing grey scales or colors, and so on -- the number of quantization points and time to learn must grow exponentially. "Real" data would probably quickly deluge such a system as this with quantization points.

I'm especially bothered by the fact that each HTM requires an exhaustive learning period where it discovers all its quantization points before it moves on to start learning how they are causally related. And then this phase requires another exhaustive learning period where it discovers all the two-moment temporal relations among quantization points before it moves on to try to group the quantization points -- distinct input patterns -- into proximal causes which are then the main output of an HTM.

Further, while I recognize the value of showing a picture of a cat in many different "orientations" using these movies as a proxy for seeing lots of actual cats, I'm bothered by the idea that the movies are required for this algorithm to learn about cats. I would think that an algorithm that learns to distinguish cats as a group should be able to see lots of single, still pictures of animals of all sorts, including a cat. Heck, if I had 10 pictures of different animals and ten neurons (or HTMs), I should be able to repeatedly show each of my 10 pictures at random with different scales and orientations and have my neurons learn to align themselves to each of the 10 animals, yet the HTMs aren't going to work this way, unless I wiggle the pictures around. Why this curious requirement?

Now, in defense of HTMs, I would point out that Jeff does not see this first generation of them as the end goal, but just a first prototype that illustrates the concept. I think he would quickly agree that the learning algorithm will continue to evolve. Not only will it become more efficient and perform faster as generations of engineers learn to apply and enhance them, but they will also come to be more robust. In fairness, I don't see that the quantization process necessarily has to happen before finding temporal relations occurs. They could happen in real time. Also, the prediction part need not wait until after learning. Also, the little right-angle black and white line drawings are not a necessity. Nor are temporal patterns relying on discrete two-step time periods. None of my complaints here represents a "gotcha", I think.

I have more to read, and I may take an opportunity to try coding this to reproduce this experiment and explore it more. We'll see. I have my own experiment that I started, inspired by my read of On Intelligence, which I have to start fleshing out, though. In the meantime, I'm likely to continue to comment on HTMs as I learn more. I still think they represent the most significant new concept in artificial intelligence in several decades.


Popular posts from this blog

Neural network in C# with multicore parallelization / MNIST digits demo

Discovering English syntax

Virtual lexicon vs Brown corpus