Patch equivalence

[Audio Version]

As I've been dodging about among areas of machine vision, I've been searching for similarities among the possible techniques they could employ. I think I've started to see at least one important similarity. For lack of a better term, I'm calling it "patch equivalence", or "PE".

The concept begins with a deceptively simple assertion about human perception: that there are neurons (or tight groups of them) that do nothing but compare two separate "patches" of input to see if they are the same. A "patch", generally, is just a tight region of neural tissue that brings input information from a region of the total input. With one eye, for example, a patch might represent a very small region of the total image that that eye sees. For hearing, a patch might be a fraction of a second of time spent listening to sounds within a somewhat narrow band of frequencies, as another example. A "scene", here, is a contiguous string of information that is roughly continuous in space (e.g., the whole image seen by one eye in a moment) or time (e.g., a few seconds of music heard by an ear). The claim here is that for any given patch of input, there is a neuron or small group of them that is looking at that patch and at another patch of the same size and resolution, but somewhere else in the scene. Further, that neuron (group) is always looking in the same pair of places at any given time. It doesn't scan other areas of the scene; just the pair of places it knows. We'll call this neuron or small group of neurons a "patch comparator".

From an engineering perspective, the PE concept is both seductively simple and horribly frightening. If I were designing a hardware solution from scratch, I imagine it would be quite easy to implement, and could execute very quickly. When I think about a software simulation of such a machine, though, it's clear to me that it would be terribly slow to run. Imagine every pixel in the scene having a large number of patch comparators associated with it. Each one would look at a small patch - maybe 5 x 5 pixels, for instance - around that pixel and at the same size patch somewhere else in the scene. One comparator might look 20 pixels to the left, another might look 1 pixel above that, another 2 pixels above, and so on until there's a sufficient amount of coverage within a certain radius around the central patch being compared. There could literally be thousands of patch comparisons done for just one single pixel in a single snapshot. Such an algorithm would not perform very quickly, to say the least.

Let's say the output of each patch comparator is a value from 0 to 1, where 1 indicates that the two patches are, pixel for pixel, identical and 0 means they are definitively different. Any value between indicates varying degrees of similarity.

One might well ask what the output of such a process is. What's the point? To be honest, I'm still not entirely sure, yet. It's a bit like asking what a brain would do with edge detection output. To my knowledge, nobody really knows in much detail, yet.

Still, I can easily see how patch equivalence could be used in many facets of input processing. Consider binocular vision, for example. You've got images coming from both eyes and you generally want to match up the objects you see in each eye, in part to help you know how far each is. One patch comparator could be looking at one place in one eye and the same place in the other. Another comparator could then be looking at the same place in the left eye as before, but in a different place in the right eye, for instance. Naturally, there would be all sorts of "false positive" matches. But if we survey a bunch of comparators that are looking at the same offset and most of them are seeing matches with that offset, we would take the consensus as indicating a likelihood that we have a genuine match. We'd throw out all the other spurious matches as noise, for lack of a regional consensus.

Pattern detection is another example of where this technique can be used. Have you ever studied a printed table of numeric or textual data where one column contains mostly a single value (e.g., "100" or "Blue")? Perhaps it's a song playlist with a dozen songs from one album, followed by a dozen from another. You scan down the list and see the name of the first album is the same for the upper dozen songs. You don't even have to read them, because your visual system tells you they all look the same. That's pretty amazing, when you think about it. In fact, I've found I can scan down lists of hundreds of identical things looking for an exception, and can do it surprisingly quickly. It's not special to me, of course; we all can. How is it that my eyes instantly pick up the similarity and call out one item that's different? It's a repeating pattern, just like a checker board or bathroom tiles. A patch equivalence algorithm would find excellent use here. Given an offset roughly equal to the distance between the centers of two neighboring lines of text, a region of comparators would quickly come to a consensus that there's equivalence at that offset. Because it's at the same time and from the same eye, the conclusion would be that it's probably from a repeating pattern. As a side note, this doesn't sufficiently explain how we detect less regular patterns, like a table full of differently colored candies, but I suspect PE can play a role in explaining that, too.

What about motion? PE can help here, too. Imagine a layer of PE comparators that study the image seen by one eye now with the same image seen a fraction of a second ago. A ball is moving through the scene, so the ball sits in one place in one image and perhaps sits a little to the right if that in the next image. Again, one region of patch comparators that sees the ball in its before and after positions lights up in consensus and thus effectively reports the position and velocity of the moving object.

I've focused on vision, but I do believe the patch equivalence concept can apply to other senses. Consider the act of listening to a song. The tempo is easily detected very quickly for most songs, and that alone can be explained by reference to PE comparators that are looking at linear patches of frequency responses at different time offsets. Or it could be looking not at low level frequency responses, but instead at recognized patterns that represent snippets of instruments at different frequencies. In fact, it may well be that we mainly come to recognize distinct sounds as distinct only because they are repeated. A comparator might be looking at one two dimensional patch that's actually made up of several frequency bands in a small snippet of time and looking at the same kind of patch at a different point in time. If it sees the same exact response in both moments, this fact could result in saving that patch's pattern in short-term memory for later. More repetitions could continue to reinforce this pattern until it's saved for longer term recollection.

This same principal of selecting patch patterns that repeat in space or time provides a strong explanation of how patterns would come to be considered important enough to remember. This is a rather hard problem in AI, now, in large part because selecting important features seems to presuppose the idea that you can find punctuations between features -- an a priori definition of "important" -- like pauses between words or empty spaces around objects in a scene. Using PE, this may not even be necessary, and potentially provides a more amorphous conception of what a boundary really is.


Popular posts from this blog

Neural network in C# with multicore parallelization / MNIST digits demo

Discovering English syntax

Virtual lexicon vs Brown corpus