Search This Blog

Thursday, November 10, 2005

Neuron banks and learning

[Audio Version]

I've been thinking more about perceptual-level thinking and how to implement it in software. In doing so, I've started formulating a model of how cortical neural networks might work, at least in part. I'm sure it's not an entirely new idea, but I haven't run across it in quite this form, so far.

One of the key questions I ask myself is: how does human neural tissue learn? And, building on Jeff Hawkins' memory-prediction model, I came up with at least one plausible answer. First, however, let me say that I use the term "neuron" here loosely. The mechanisms I ascribe to individual neurons may turn out to be more a function of groups of them working in concert.

Let me start with the notion of a group of neurons in a "neural bank". A bank is simply a group of neurons that are all looking at the same inputs, as illustrated in the following figure:

Figure: Schematic view of neuron bank.

Perhaps it's a region of the input coming from the auditory nerves. Or perhaps it's looking at more refined input from several different senses. Or perhaps even a more abstract set of concepts at a still higher level. It may not be that there are large numbers of neurons that all look at the same chunk of inputs -- it may be more messy than that -- but this is a helpful idea, as we'll soon see. Further, while I'll speak of neural banks as though they all fall into a single "layer" in the sense that traditional artificial neural networks are arranged, it's more likely that this neural bank idea applies to an entire patch of 6-layered cortical tissue in one's brain. Still, I don't want to get mired in such details in this discussion.

Each neuron in a bank is hungry to contribute to the whole process. In a naive state, they might all simply fire, but such a cacophony would probably be counterproductive. In fact, our neural banks could be hard-wired to favor having a minimal number of neurons in a bank firing at any given time -- ideally, zero or one. So each neuron is eager to fire, but the bank, as a whole, doesn't want them to fire all at once.

These two forces act in tension to balance things out. How? Imagine that each neuron in a bank is such that when it fires, its signal tends to suppress the other neurons in the bank. Suppress how? Two ways: firing and learning. When a neuron is highly sure that it is perceiving a pattern it has learned, it fires very strongly. Other neurons that may be firing because they have weak matches would be self-silenced by these louder neurons, on the assumption that the louder neurons must have more reason to be sure of the patterns they perceive. Consider the following figured, modified from above to show this feedback:

Figure: Neuron bank with feedback from neighbors.

But what about learning? What does a neuron learn and why would we want other neurons to suppress it? First, what is learned by a neuron is one or more patterns. For simplicity, let's say it's a simple, binary pattern. For each dendritic synapse looking at input from outside axons that a neuron has, we'll say it either cares or doesn't care and, if it does, it prefers either a firing or not-firing value. The following figure illustrates this, schematically:

Figure: Detail of a synapse.

Following is a logical behavior table. It is equivalent to a logical exclusive or (XOR) operation:

Preferred InputActual InputMatches

Let's describe the desired input pattern in terms of a string of zeros (not firing), ones (firing), and exes (don't care). For example, a neuron might prefer to see "x x 0 x 1 0 x 1 0 0 x 0 x x 1". When it sees this exact pattern, it fires strongly. But maybe when it sees all but one of the inputs it cares about doesn't fit. It still fires, but not as strongly. If another neuron is firing more strongly, this one shuts up.

That's what's learned but not how it's learned. Let's consider that more directly. A neuron that fires on a regular basis is "happy" with what it knows. It's useful. It doesn't need to learn anything else, it seems. But what about a neuron that never gets a chance to fire because its pattern doesn't match much of anything? I argue that this "unhappy" neuron wants very much to be useful. It searches for novel patterns. What does this mean? There are many possible mechanisms, but let's consider just one. We'll assume all the neurons started out with random synaptic settings (0, 1, or x). Now let's say that there is a certain combination of inputs that no neuron in the bank shouts out to say "I got this one". Some of these neurons see that some of the inputs do match. These are inclined to believe that this input is probably a pattern that can be learned, so they change some of their "wrong" settings to better match the current input. The more strongly the match already is for a given unhappy neuron, the more changes that neuron is likely to make to conform to this new input.

Now let's say this particular combination of input values (0s and 1s) continues to appear. At least one neuron will continue to grow ever more biased towards matching that pattern that eventually it will start shouting out like other "happy" neurons do.

This does seem to satisfy a basic definition for learning. But it does leave many questions unanswered. One is: how does it decide whether or not to care about an input? I don't know the answer, but here's one plausible answer. A neuron -- whether "happy" or "unhappy" with what it knows -- can allow its synaptic settings to change over time. Consider a happy one. It continues to see its favored pattern and fires whenever it does. Seeing no other neurons contending for being the best at matching its pattern, it is free to continue learning in a new way. In particular, it looks for patterns at the individual synapse level. If one synaptic input is constantly the same value whenever this one fires, it favors setting that synapse to "do care". If, conversely, it changes with some regularity, this neuron will favor setting that one to "don't care".

Interestingly, this leads to a new set of possible contentions and opportunities for new knowledge. One key problem in conceptualization is learning when to recognize that two concepts should be merged and when one concept should be subdivided into other narrower ones. When do you learn to recognize two different dogs are actually part of the same group of objects called "dogs"? And why do you decide that a chimpanzee, which looks like a person, is really a wholly new kind of thing that deserves its own concept?

Imagine that there is one neuron in a bank of them that has mastered the art of recognizing a basset hound dog. And let's say that's the only kind of dog this brain has ever seen before. It has seen many different bassets, but no other breed. This neuron's pattern recognition is greedy, seeing all the particular facets of bassets as essential to what dogs are all about. Then, one day, this brain sees a Doberman pinscher for the first time. To this neuron, it seems very like a basset, but there are enough features to be doubtful. Still, nobody else is firing strongly, so this one might as well, considering itself to have the best guess. This neuron is strongly invested in a specific kind of dog, though. It would be worthwhile to have another neuron devoted to recognizing this other kind of dog. What's more, it would be valuable to have yet another neuron that recognizes dogs more generally. How would that come about?

In theory, there are other neurons in this bank that are hungry to learn new patterns. One of them could see the lack of a strong response from any other neuron as an opportunity to learn either the more specific Dobie pattern or of the more general dog pattern.

One potential problem is that the neurons that detect more specific features -- bassets versus all dogs, for example -- might tend to make more general concepts like "dog" go away. There must be some incentive. One explanation could be frequency. The dog neuron might not have as many matching features to consider as the basset neuron does, but if this brain sees lots of different dogs and only occasionally bassets, the dog neuron would get exercised more frequently, even if it doesn't shout the loudest when a basset is seen. So perhaps both frequency and strength of matching are strong prompts for a neuron that it's learned well.

I have no doubt that there's much more to learning and the neocortex, more generally. Still, this seems a plausible model for how learning could happen there.

Thursday, November 3, 2005

A standardized test of perceptual capability

[Audio Version]

I've been getting too lost in the idiosyncrasies of machine vision of late and missing my more important focus on intelligence, per se. I'm changing direction, now.

My recent experiences have shown me that one thing we haven't really done well is in the area of perceptual level intelligence. We have great sensors and cool algorithms for generating interesting but primitive information about the world. Edge detection, for example, can be used to generate a series of lines in a visual scene. But so what? Lines are just about as disconnected from intelligence as the raw pixel colors are.

Where do primitive visual features become percepts? Naturally, we have plenty of systems designed to instantly translate visual (or other sensory) information into known percepts. Put little red dots around a room, for instance, and a visual system can easily cue in on them as being key markers for a controlled-environment system. This is the sort of thinking that is used in vision-based quality control systems, too.

But what we don't have yet is a way for a machine to learn to recognize new percepts and learn to characterize and predict their behavior. I've been spending many years thinking about this problem. While I can't say I have a complete answer yet, I do have some ideas. I want to try them out. Recently, while thinking about the problem, I formulated an interesting way to test a perceptual-level machine's ability to learn and make predictions. I think it can be readily reproduced on many other systems and extended for ever more capable systems.

The test involves a very simplified, visual world composed of a black rectangular "planet" and populated by a white "ball". The ball, a small circle whose size never changes, moves around this 2D world in a variety of ways that, for the most part, are formulaic. One way, for example, might be thought of as a ball in a box in space. Another can be thought of as a ball in a box standing upright on Earth, meaning it bounces around in parabolic paths as though in the presence of a gravitational field. Other variants might involve random adjustments to velocity, just to make prediction more difficult.

The test "organism" would be able to see the whole of this world. It would have a "pointer". Its goal would be to move this pointer to wherever it believes the ball will be in the next moment. It would be able to tell where the pointer currently points using a direct sense separate from its vision.

Predicting where the ball will be in the future is a very interesting test of an organism's ability to learn to understand the nature of a percept. Measuring the competency of a test organism would be very easy, too. For each moment, there is a prediction, in the form of the pointer pointing to where it believes the ball will be in the next moment. When that moment comes, the distance between the predicted and actual positions of the ball is calculated. For any given set series of moments, the average distance would be the score of the organism in that context.

It would be easy for different researchers to compare their test organisms against others, but would require a little bit of care to put each test in a clear context. The context would be defined by a few variables. First is the ball behavior algorithm that is used. Each such behavior should be given a unique name and a formal description that can be easily implemented in code in just about any programming language. Second, the number of moments used to "warm up", which we'll call the "warm up period". That is, it should take a while for an organism to learn about the ball's behavior before it can be any good at making predictions. Third, the "test period"; i.e., the number of moments after the warm-up period is done in which test measurements are taken. The final score in this context, then, would be the average of all the distances measured between prediction and actual position.

There would be two standard figures that should be disclosed with any given test results. One is that the best possible score is 0, which means the predictions are always correct. The second is the best possible score for a "lazy" organism. In this case, a lazy organism is one that always guesses that the ball will be in the same place in the next moment that it is now. Naturally, a naive organism would do worse than this cheap approximation, but a competent organism should do better. The "lazy score" for a specific test run would be calculated as the average of all distances from each moment's ball position to its next moment's position. A weighted score for the organism could then be calculated as a ratio of actual score to lazy score. A value of zero would be the best possible. A value of one would indicate that the predictions are no better than the lazy score. A value greater than one would indicate that the predictions are actually worse than the lazy algorithm.

Some might quip that I'm just proposing a "blocks world" type experiment and that an "organism" competent to play this game wouldn't have to be very smart. I disagree. Yes, a programmer could preprogram an organism with all the knowledge it needs to solve the problem and even get a perfect score. A proper disclosure of the algorithm used would let fellow researchers quickly disqualify such trickery. So would testing that single program against novel ball behaviors. What's more, I think a sincere attempt to develop organisms that can solve this sort of problem in a generalizable way will result in algorithms that can be generalized to more sophisticated problems like vision in natural settings.

Naturally, this test can also be extended in sophistication. Perhaps there could be a series of levels defined for the test. This might be Level I. Level II might involve multiple balls of different colors. And so on.

I probably will draft a formal specification for this test soon. I welcome input from others interested in the idea.

Saturday, October 29, 2005

Using your face and a webcam to control a computer

[Audio Version]

I don't normally do reviews of ordinary products. Still, I tried out an interesting one recently that makes practical use of a fairly straightforward machine vision technique that I thought worth describing.

The product is called EyeTwig (, and is billed as "head mouse". That is, you put a camera near your computer monitor, aim it at your face, and run the program. Then, when you move your head left and right, up and down, the Windows cursor, typically controlled by your mouse, moves about the screen in a surprisingly intuitive and smooth fashion.

Most people would recognize the implication that this could be used by the disabled. I thought about it, though, and realized that this application is limited mainly to those without mobility below the neck. And many of those in that situation have limited mobility of their heads. Still, a niche market is still a market. I think the product's creator sees that the real potential lies in an upcoming version that will also be useful as a game controller.

In any event, the program impressed me enough to wonder how it works. The vendor was unwilling to tell me in detail, but I took a stab at hypothesizing how it worked and running some simple experiments. I think the technique is fascinating by itself, but also could be used in kiosks, military, and various other interesting applications.

When I first saw how EyeTwig worked, I was impressed. I wondered what sorts of techniques it might use for recognizing a face and realizing that it is changing its orientations. The more I studied how it behaved, though, the more I realized it uses a very simple set of techniques. I realized, for example, that it ultimately uses 2D techniques and not 3D techniques. Although the instructions are to tilt your head, I found that simply shifting my head left and right, up and down worked just as well.

The process for machines of recognizing faces is now a rather conventional one. My understanding is that most techniques start by searching for the eyes on a face. It is almost universal that human eyes will be found in two dark patches (eye sockets are usually shadowed) of similar size and roughly side by side and with a pretty tight distance-between proportion. So programs find candidate patch pairs, assume they are eyes, and then look for the remaining facial features in relation to those patches.

Using a white-board to simulate a face. EyeTwig appears to be no different. In addition to finding eyes, though, I discovered that it looks for what I'll loosely call a "chin feature". It could be a mustache, a mouth, or some other horizontal, dark feature directly under the eyes. I discovered this by experimenting with abstract drawings of the human face. My goal was to see how little a drawing needed to be sufficient for EyeTwig to work. The figure at right shows one of the minimal designs that worked very well: a small white-board with two vertical lines for eyes and one horizontal line for a "chin". When I slid the board left and right, up and down, EyeTwig moved the cursor as expected.

One thing that made testing this program out much easier is the fact that the border of the program's viewer changes color between red and green to indicate whether it recognizes what it sees as a face.

In short, EyeTwig employs an ingenious, yet simple technique for recognizing that a face is prominently featured in the view of a simple web-cam. No special training of the software is required for that. For someone looking to deploy practical face recognition applications, this seems to provide an interesting illustration and technique.

Saturday, October 8, 2005

Stereo disparity edge maps

[Audio Version]

I've been experimenting further with stereo vision. Recently, I made a small breakthrough that I thought worth describing for the benefit of other researchers working toward the same end.

One key goal of mine with respect to stereo vision has been the same as for most involved in the subject: being able to tell how far away things in a scene are from the camera or, at least, relative to one another. If you wear glasses or contact lenses, you've probably seen that test of your depth perception in which you look through polarizing glasses at a sheet of black rings and attempt to tell which one looks like it is "floating" above the others. It's astonishing to me just how little disparity there has to be between images in one's left and right eyes in order for one to tell which ring is different from the others.

Other researchers have used a variety of techniques for getting a machine to have this sort of perception. I am currently using a combination of techniques. Let me describe them briefly.

First, when the program starts up, the eyes have to get focused on the same thing. Both eyes start out with a focus box -- a rectangular region smaller than the image each eye sees and analogous to the human fovea -- that is centered on the image. The first thing that happens once the eyes see the world is that the focus boxes are matched up using a basic patch equivalence technique. In this case, a "full alignment" involves moving the right eye's focus patch in a grid pattern over the whole field of view of the right eye in large increments (e.g., 10 pixels horizontally and vertically). The best-matching place then becomes the center of a second scan in single pixel increments in a tighter region to find precisely the best matching placement for the right field of view.

The full alignment operation is expensive in terms of time: about three seconds on my laptop. With every tenth snapshot taken by the eyes, I perform a "horizontal alignment", a trimmed-down version of the full alignment. This time, however, the test does not involve moving the right focus box up or down relative to its current position; only left and right. This, too, can be expensive: about 1 second for me. So finally, with each snapshot taken, I perform a "lite" horizontal alignment, which involves looking a little to the left and to the right of the current position of the focus box. This takes less than a second on my laptop, which is definitely worth making it standard with each snapshot. The result is that the eyes generally line their focus boxes up quickly on the objects in the scene as they are pointed at different viewpoints. If the jump is too dramatic for the lite horizontal alignment process, eventually the full horizontal alignment process corrects for that.

Once the focus boxes are lined up, the next step is clear. For each part of the scene that is in the left focus box, look for its mate in the right focus box. Then calculate how many pixels offset the left and right versions are from each other. Those with zero offsets are at a "neutral" distance, relative to the focus boxes. Those with the right versions' offsets being positive (a little to the right) are probably farther away. And those with the right hand features having negative offsets (a little to the left) are probably closer. This much is conventional wisdom. And the math is actually simple enough that one can even estimate absolute distances from the camera, given that some numeric factors about the cameras are known in advance.

The important question, then, is how to match features in the left focus box with the same features in the right. I chose to use a variant of the same patch equivalence technique I use for lining up the focus boxes. In this case, I break down the left focus box into a lot of little patches -- one for each pixel in the box. Each patch is about 9 pixels wide. What's interesting, though, is that I'm using 1-dimensional patches, which means each patch is only one pixel high. For each patch in this tight grid of (overlapping) patches in the left focus box, there is a matching patch in the right focus box, too. Initially, its center is exactly the same as for the left one, relative to the focus box. For each patch in the left side, then, we move its right-hand mate from left to right from about -4 to +4 pixels. Whichever place yields the lowest difference is considered the best match. That place, then, is considered to be where the right-hand pixel is for the one we're considering on the left, and hence we have our horizontal offset.

For the large fields of homogenous color in a typical image, it doesn't make sense to use patch equivalence testing. It makes more sense to focus instead on the strong features in the image. So to the above, I added a traditional Sobel edge detection algorithm. I use it to scan the right focus box, but I only use the vertical test. That means I find strong, vertical edges and largely ignore strong horizontal edges. Why do this? Stereo disparity tests with two eyes side by side only work well with strong vertical features. So only pixels in the image that get high values from the Sobel test are considered using the above technique.

This whole operation takes a little under a second on my laptop -- not bad.

Following are some preliminary image sets that show test results. Here's how to interpret them. The first two images in each set are the left and right fields of view, respectively. The third image is a "result" image. That is, it shows features within the focus box and indicates their relative distance to the camera. Strongly green features are closer to, strongly red features are farther away, and black features are at relatively neutral distances, with respect to the focus box pair. The largely white areas represent areas with few strong vertical features and are hence ignored in the tests.

In all, I'm impressed with the results. One can't say that the output images are unambiguous in what they say about perceived relative distance. Some far-away objects show tinges of green and some nearby objects show have tinges of red, which of course doesn't make sense. Yet overall, there are strong trends that suggest this technique is actually working. With some good engineering, the quality of results can be improved. Better cameras wouldn't hurt, either.

One thing I haven't addressed yet is the "white" areas. A system based on this might see the world as though it were made up of "wire frame" objects. If I want to have a vision system that's aware of things as being solid and having substance, it'll be necessary to determine how far away the areas among the sharp vertical edges are, too. I'm certain that a lot of that has to do with inferences our visual systems make based on the known edges and knowledge of how matter works. Obviously, I have a long way to go to achieve that.

Sunday, September 25, 2005

Some stereo vision illusions

[Audio Version]

While engaging in some stereo vision experiments, I found myself a little stuck. I stopped working for a while and started staring at a wall on the opposite side of the room, pondering how my own eyes deal with depth perception. I crossed my eyes to study certain facets of my visual system.

I got especially interested when I crossed my eyes so that the curtains on either side of the doorway were overlapped. I wasn't surprised to find my eyes were only too happy to lock the two together, given how similar they looked. I was, however, surprised to see how well my visual system fused various differences between the two images together into a single end product. It even became difficult to tell which component of the combined scene came from which eye without closing one eye.

I thought it worthwhile to create some visual illusions based on some of these observations. To view them, you'll need to cross your eyes so that your right eye looks at the left image and vice-versa.

Plain, no illusion.

This first figure, above, is just for practice. Both sides are identical and should form a scene that is 3D of a doorway with curtains on either side. The curtains should appear recessed slightly behind the wall. If they appear in front, your eyes are not properly crossed.

Missing Curtain.

This second figure presents an interesting dilemma for your eyes. (If you have trouble focusing, try using the upper or lower corners of the door frame to get your eyes locked into the scene.) You know there are two curtains and your vision expects them, but one is missing from just one side. You may find the "phantom" curtain floats left and right and even forward and backward as your eyes go searching for its "other half". Interestingly, you'll find that much of the time, it doesn't appear to be "half as green". Rather than appear like a dimmer version of the curtain on the right, it should typically appear to have exactly the same color. It's as though your eyes ignore the black background and accept the green curtain.

Missing dots.

This figure is quite fascinating to me. Two dots on the left are half trimmed away in one eye and another dot is totally missing in the other. As before, your visual system should accept the fact that dots really are there and that you're just having trouble finding them with one eye each. Again, the "phantom" dots are just as purple as the ones that have perfect mates on both sides. Note how the phantom can be darker or lighter than its background without impact on this effect? Also, you'll find the phantom dot on the right floats back and forth as your eyes try to find its mate, yet the same is not true for the half-dot phantoms. They seem to be solidly fixed horizontally by the rest of the dots. Interestingly, I find little chunks of these phantom dot-halves seem to come and go as my vision tries to decide if they really should be whole dots or half dots. It almost seems to compromise by concluding that they are "flatter" dots - ellipses that are as wide as the other dots but a little shorter, vertically.

Missing stripes.

This figure one is fairly straightforward. One curtain has phantom stripes just like the other one's. The stripes appear to veer back and forth a little. They also do appear to be lighter, most of the time.

Mismatched stripes.

This one is a bit more subtle than the others. The number of stripes in one of the curtains is not the same in both images. This is because one of the curtain views has the stripes significantly farther apart. Your vision will probably fight over different interpretations. One is that the lines have varying spacing from outside to inside. A variant of this is that the curtain is actually a round column. Another is that the lines are somehow behind or in front of the curtain. What's most interesting to me is that my eyes never seem to give away the fact that there is a different number of lines for the left (5) and right (4) versions. My eyes are sure they find matches for each and every line.

Bloody curtain.

This figure presents a somewhat different illusion. One of the versions of the right curtain appears to have a blood-red stain dripping down from the top. Again, the colors don't really blend. Your vision should pick one color or the other. It most likely will pick the red "stain", though I find with some effort, I can make the red stain almost completely disappear. This only works for me when I stare right at the top of the right-hand curtain, where the red is. If I stare at the bottom of either curtain, the red stubbornly remains.

Miscolored curtain.

This final figure is much more difficult to reconcile, I find. Because the right curtain's red and green alternatives are so different, my eyes frequently try shifting to find better candidates. If I stare at the top or bottom of that curtain, it helps to lock them together, though, suggesting that the corners are stronger features than the vertical edges of the curtain, alone. The color of this curtain never seems to stabilize. Curiously, when I stare at the lower part of the right curtain, it seems more likely to settle on green, yet the rest of the curtain above vacillates between green and red even while this part is stable. When I stare at the center, the whole bar is likely to go back and forth between the colors, but it's almost equally likely that the bar will appear to have shades of both colors at the same time. And blinking is almost certain to instantly disrupt whatever color it starts to settle on.

I hope you find these stereo visual illusions to be thought provoking. It seems to shed light for me on the task ahead as I continue development of stereo vision software.

Wednesday, September 21, 2005

Topics in machine vision

[Audio Version]

Once again, I've forgotten to announce a sub-site I created recently that I call Topics in Machine Vision (click here), back on August 28th.

Unlike my earlier Introduction to Machine Vision, it does not set out to give a broad overview of the subject matter. Instead, it's geared toward the researcher with at least some familiarity with the subject. Also, whereas I intended the introduction to stand complete on its own, Topics is more organic, meaning that I'll continue to add content to it as time passes.

Knowing that this could get to be difficult to read and manage, I've broken down Topics into separate sections and pages. The first section I've fleshed out is on the Patch Equivalence concept I introduced in an earlier blog entry here. In fact, once I introduced this topic in detail, I went back and ran some experiments in application of the PE concept to stereo vision and published the results, including tons of example images that demonstrate both the strengths and weaknesses of my implementation at the time.

I intend to tackle plenty of other topics, including generic views, lighting effects, and application of the memory-prediction model, for instance.

Friday, August 26, 2005

Introduction to machine vision

[Audio Version]

Recently, I completely forgot to mention that I published a brief introduction to machine vision (click here) on August 14th. It's meant to be tailored to people who want to better understand the subject but haven't had much experience outside the popular media's thin portrayal of it. By contrast, much of what's written that gets into the nuts and bolts is often difficult to read because it requires complex math skills or otherwise expects you to have a fairly strong background in the subject, already.

I'm especially fond of demystifying subjects that look exceptionally complex. Machine vision often seems like a perfect example of pinheads making the world seem too complicated and their work more impressive than it really is. Sometimes it comes down to pure huckstery as cheap tricks and carefully groomed examples are employed in pursuit of funding or publicity. Then again, there's an awful lot of very good and creative work out there. It's fun to show that much of the subject can be approachable to even notice programmers and non-programmers.

I spent a few months putting the introduction together. I'm not entirely happy with the final result, as I imagined it would have a much broader scope. Ultimately, a lack of sufficient time to devote to it meant I had to leave out interesting applications of the basics like optical character recognition (OCR) and face recognition for fear that it would never be done.

I am, however, starting work on a less ambitious project to address more esoteric topics in machine vision. I should begin publishing drafts of early material within it very soon.

Sunday, August 14, 2005

Bob Mottram, crafty fellow

[Audio Version]

I sometimes use my rickety platform here to review new technologies and web sites, but I haven't done enough to give kudos to the unusual people in AI that dot the world and sometimes find their way online. Bob Mottram is one such person that deserves mention.

Bob Mottram with his creation, Rodney Who is Bob Mottram? He's a 33-ish year old British programmer who has found a keen interest in the field of Artificial Intelligence. He seems to be fairly well read on a variety of studies and technologies that are around. What starts to make him stand out is his active participation in the efforts. Like me, he finds that many of the documents out there that describe AI technologies sound tantalizingly detailed, but are actually very opaque when it comes to the details. Unlike most, however, he takes this simply as a challenge to surpass. He designs and codes and experiments until his results start to look like what is described in the literature.

The next thing that sets Mottram apart is his willingness to step outside the bounds of simply duplicating other people's work. He applies what he learns and hypothesizes about new ways of solving problems, going so far as to envision tackling the high goal of duplicating the inner workings of the brain in software.

Perhaps what really sets Bob Mottram apart, for me, is his willingness to take his work public. His web site ( is chock full not only of listings of projects he's worked on, but also keen and easy-to-read insights on what he's learned along the way. He also has the venerable habit of peppering his material with links to related content as background and credit.

Mottram's web site has a fascinating smattering of content about various projects he's worked on. The one that first got my attention was his "Rodney" project. Named after Rodney Brooks, creator of the famous Genghis and Cog robots, Rodney is Mottram's low-budget answer to Cog.

Rodney the robot

Through a set of successive iterations, Mottram had built Rodney to be ever more sophisticated as a piece of hardware, but more importantly, had continued to experiment with a variety of different sensing and control techniques. His project web site documents many of these experiments. He also makes available much of his source code.

What got my attention in the first place was his page on Rodney's vision system. Do a Google search on "robot stereo vision" or using a variety of related terms and you're likely to find Bob Mottram's page on his research. It's not necessarily that his work is really groundbreaking; it's just that he's one of the only people to really document his work. As I was doing background research for an upcoming introduction to machine vision, I found his site over and over again in relation to certain kinds of techniques he's implemented and documented.

Seeing the general utility of the vision system he was creating for Rodney, Mottram moved on to his Sentience project. The primary goal was to extract and make open-source a software component that can use input from two cameras to construct a 3D model of what the eyes see.

Other Mottram stuff

Mottram's web site includes plenty of other interesting and arcane experiments. Many are whimsical applications of his experiments with stereo vision and detecting motion and change in images, like a Space Invader type game where the player's image is transposed with the aliens or a program that detects people moving within a stationary webcam's field of view. Some delve deeper into new research, like his face detection and imitation work or his Robocore project.

Finally, Mottram has his very own blog. It's not specifically for AI, but does include various insights into the subject from time to time.

In all, I give Bob Mottram a good heap of credit for being a crafty fellow who is sincere in his belief in and pursuit of the goals of Artificial Intelligence. And he gets major kudos for sharing his work online for geeks like me. Do check out his web site.

Thursday, August 11, 2005

Stereo vision: measuring object distance using pixel offset

[Audio Version]

I've had some scraps of time here and there to put toward progress on my stereo vision experiments. My previous blog entry described how to calibrate a stereo camera pair to find X and Y offsets that correspond in the right camera with the same position in the left camera when they are both looking at a far off "infinity point". Once I had that, I knew it was only a small step to use the same basic algorithm for dynamically getting the two cameras "looking" at the same thing even when the subject matter is close enough for the two cameras to actually register a difference. And since I have the vertical offset already calculated, I was happy to see the algorithm running along this single horizontal "rail" runs faster.

The next logical step, then, was to see if I could figure out the formula for telling how far away what the cameras are looking at is from the cameras. This is one of the core reasons for using a pair of cameras instead of one. I looked around the web for some useful explanation or diagrams. I found lots of inadequate diagrams that show the blatantly obvious, but nothing that was complete enough to develop a solution from.

So I decided to develop my own solution using some basic trigonometry. It took a while and a few headaches, but I finally got it down. I was actually surprised at how well it worked. I thought I should publish the method I used in detail so other developers can get past this more quickly. The following diagram graphically illustrates the concept and the math, which I explain further below.

Using pixel offset to measure distance

I suppose I'm counting on you knowing what the inputs are. If you do, skip the next paragraph. Otherwise, this may help.

Combined image of a hand from left and right cameras The pink star represents some object that the two cameras are looking at. Let's assume the cameras are perfectly aligned with each other. That is, when the object is sufficiently far away -- say, 30 feet or more for cameras that are 2.5 inches apart -- and you blend the images from both cameras, the result looks the same as if you just looked at the left or right camera image. But if you stick your hand in front of the camera pair at, say, 5 feet away and look at the combined image, you see two "hands" partly overlapping. Let's say you measured the X (horizontal) offset of one version of the hand from the other as being about 20 pixels. Now, you change the code to overlap the pictures so that the right-hand one is offset by 20 pixels. Now the two hands perfectly overlap and it's the background scene that's doubled-up. The diagram above is suggestive of this in the middle section, where the pink star is in different places in the left and right camera "projections". These projections are really just the images that are output. Now that you grasp the idea that the object seen by the two cameras is the same, but simply offset to different positions in each image, we can move on. Assume for now that we already have code that can measure the offset in pixels I describe above.

Once I got through the math, I made a proof of concept rig to calculate distance. I simply tweaked the "factor" constant by hand until I started getting distances to things in the room that jibed with what my tape measure said. Then I went on to work the math backward so that I could enter a measured distance and have it calculate the factor, instead. I packaged that up into a calibration tool.

I expected it would work fairly well, but I was truly surprised at how accurate it is, given the cheap cameras I have and the low resolution of the images they output. I found with objects I tested from two to ten feet away, the estimated distance was within two inches of what I measured using a tape measure. That's accurate enough, in my opinion, to build a crude model of a room for purposes of navigating through it, a common task in AI for stereo vision systems.

I haven't yet seen how good it is at distinguishing distance in objects that are very close to one another using this mechanism. We can easily discriminate depth offsets of a millimeter on objects within two feet. These cameras are not that good, so I doubt they'll be as competent.

So now I have a mechanism that does pretty well at taking a rectangular portion of a scene and finding the best match it can for that portion in the other eye and using it to calculate the distance based on the estimated offset. The next step, then, is to repeat this over the entire image and with ever smaller rectangular regions. I can already see some important challeges, like what to do when the rectangle just contains a solid color or a repeating pattern, but these seem modest complications to an otherwise fairly simple technique. Cool.

[See also Automatic Alignment of Stereo Sameras.]

Sunday, August 7, 2005

Automatic alignment of stereo cameras

[Audio Version]

I'm currently working on developing a low-level stereo vision component tentatively called "Binoculus". It builds on the DualCameras component, which provides basic access to two attached cameras. To it, Binoculus already adds calibration and will hopefully add some basic ability to segment parts of the scene by perceived depth.

For now, I've only worked on getting the images from the cameras to be calibrated so they both "point" in the same direction. The basic question here is: once the cameras point roughly in the same direction, how many horizontal and vertical pixels off is the left one from the right? I had previously pursued answering this using a somewhat complicated printed graphic and a somewhat annoying process, because I was expecting I would have to deal with spherical warping, differing camera sizes, differing colors, and so on. I've come to the conclusion that this probably won't be necessary, and that all that probably will be is getting the cameras to agree on where an "infinity point" is.

This is almost identical to the question posed by a typical camera with auto-focus, except that I have to deal with vertical alignment in addition to the typical horizontal alignment. I thought it worthwhile to describe the technique here because I have had such good success with it and it doesn't require any special tools or machine intelligence.

We begin with a premise that if you take the images from the left and right cameras and subtract them, pixel for pixel, the closer the two images are to pointing at the same thing, the lower will be the sum of all pixel differences. To see what I mean, consider the following figure, which shows four versions of the same pair of images with their pixel values subtracted out:

Subtracting two images at different overlap offsets

From left to right, each shows the difference between the two images as they get closer to best alignment. See how they get progressively darker? As we survey each combined pixel, we're adding up the combined difference of red, green, and blue values. The ideal match would have a difference value of zero. The worst case would have a difference value of Width * Height * 3 * 255.

Now let's start with the assumption that we have the cameras perfectly aligned, vertically, so we only have to adjust the horizontal alignment. We start by aiming our camera pair at some distant scenery. My algorithm then takes a rectangular subsection of the eyes' images - about 2/5 of the total width and height - from the very center of the left eye. For the right eye, it takes another sample rectangle of the same exact size and moves it from the far left to the far right in small increments (e.g., 4 pixels). The following figure shows the difference values calculated for different horizontal offsets:

Notice how there's a very clear downward spike in one part of the graph? At the very tip of that is the lowest difference value and hence the horizontal offset for the right-hand sample box. That offset is, more generally, the horizontal offset for the two cameras and can be used as the standard against which to estimate distances to objects from now on.

As a side note, you may notice that there is a somewhat higher sample density near the point where the best match is. That's a simple optimization I added in to speed up processing. With each iteration, we take the best offset position calculated previously and have a gradually higher density of tests around that point, on the assumption that it will still be near there with the next iteration. Near the previous guessed position, we're moving our sampling rectangle over one pixel at a time, whereas we're moving it about 10 pixels at the periphery.

What about the vertical alignment? Technically speaking, we should probably do the same thing I've just described over a 2D web covering the entire right-hand image, moving the rectangle throughout it. That would involve a high amount of calculation. I used a cheat, however. I start with the assumption that the vertical alignment starts out pretty close to what it should be because the operator is careful about alignment. So with each calibration iteration, my algorithm starts by finding the optimal horizontal position. It then runs the same test vertically, moving the sample rectangle from top to bottom along the line prescribed by the best-fitting horizontal offset. If the outcome says the best position is below where the current vertical offset value, we add one to it to push it one pixel downward. Conversely, if the best position seems to be above, we subtract one from the current offset value and so push it upward. The result is a gradual sliding up or down, whereas the horizontal offset calculated is instantly implemented. You can see the effects of this in the animation to the right. Notice how you don't see significant horizontal adjustments with each iteration, but you do see vertical ones?

Why do I gradually adjust the vertical offset? When I tried letting the vertical and horizontal alignments "fly free" from moment to moment, I was getting bad results. The vertical alignment might be way off because the horizontal was way off. Then the horizontal alignment, which is along the bad vertical offset, would perform badly and the cycle of bad results would continue. This is simply because I'm using a sort of vertical cross pattern to my scanning, instead of scanning in a wider grid pattern. This tweak, however, is quite satisfactory, and seems to work well in most of my tests so far.

I wish I could tell you that this works perfectly every time, but there is one bad behavior worth noting. Watch the animation above carefully. Notice how as the vertical adjustments occur, there is a subtle horizontal correction? Once the vertical offset is basically set, the horizontal offset switches back and forth one pixel about three times, too, before it settles down. I noticed this sort of vacillation in both the vertical and horizontal in many of my test runs. I didn't spend much time investigating the cause, but I believe it has to do with oscillations between the largely independent vertical and horizontal offset calculations. When one changes, it can cause the other to change, which in turn can cause the other to change back, ad infinitum. The solution generally appears to be to bump the camera assembly a little so it seem something that may agree with the algorithm a little better. I also found that using a sharply contrasting image, like the big, black dot I printed out, seems to be a little better than softer, more naturalistic objects like the picture frame you see above the dot.

It's also worth noting that it's possible that the vertical alignment could be so far off and the nature of the scene be such that the horizontal scanning might actually pick the wrong place to align with. In that case, the vertical offset adjustments could potentially head off in the opposite direction from what you expect. I saw this in a few odd cases, especially with dull or repeating patterned backdrops.

Finally, I did notice that there were some rare close-up scenes I tried to calibrate with in which the horizontal offset estimate was very good, but the vertical offset would move in the opposite direction from that desired. I never discovered the cause, but a minor adjustment of the cameras' direction would fix it.

When I started making this algorithm, it was to experiment with ways to segment out different objects based on distance from the camera. It quickly turned into a simple infinity-point calibration technique. What I like most about it is how basically autonomous it is. Just aim the cameras at some distant scenery, start the process, and let it go until it's satisfied that there's a consistent pair of offset values. When it's done, you can save the offset values in the registry or some other persistent storage and continue using it with subsequent sessions.

DualCameras component

[Audio Version]

I have been getting more involved in stereo, or "binocular", vision research. So far, most of my actual development efforts have been on finding a pair of cameras that will work together on my computer, an annoying challenge, to be sure. Recently, I found a good pair, so I was able to move on to the next logical step: creating an API for dealing with two cameras.

Using C#, I created a Windows control component that taps into the Windows Video Capture API and provides a very simple interface. Consumer code needs only start capturing, tell it to grab frames from time to time when it's ready, and eventually (optionally) to stop capturing. There's no question of synchronizing or worrying about a flood of events. I dubbed the component DualCameras and have made it freely available for download, including all source code and full documentation.

DualCameras component

I've already been using the component for a while now and have made some minor enhancements, but I'm happy to say it has just worked this whole time; no real bugs to speak of. It's especially nice to know how all the wacky window creation and messaging that goes on under the surface is quietly encapsulated and that the developer need not understand any of it to use the components. Just ask for a pair of images and it will wait until it has them both. Simple. I certainly can't say that of all the programs I've made.

The home page I made for the component also has advice about how to select a pair of cameras. I went through a bunch of different kinds before I found one that worked, so I thought I'd share my experience to help save others some headaches.

Saturday, July 30, 2005

Patch equivalence

[Audio Version]

As I've been dodging about among areas of machine vision, I've been searching for similarities among the possible techniques they could employ. I think I've started to see at least one important similarity. For lack of a better term, I'm calling it "patch equivalence", or "PE".

The concept begins with a deceptively simple assertion about human perception: that there are neurons (or tight groups of them) that do nothing but compare two separate "patches" of input to see if they are the same. A "patch", generally, is just a tight region of neural tissue that brings input information from a region of the total input. With one eye, for example, a patch might represent a very small region of the total image that that eye sees. For hearing, a patch might be a fraction of a second of time spent listening to sounds within a somewhat narrow band of frequencies, as another example. A "scene", here, is a contiguous string of information that is roughly continuous in space (e.g., the whole image seen by one eye in a moment) or time (e.g., a few seconds of music heard by an ear). The claim here is that for any given patch of input, there is a neuron or small group of them that is looking at that patch and at another patch of the same size and resolution, but somewhere else in the scene. Further, that neuron (group) is always looking in the same pair of places at any given time. It doesn't scan other areas of the scene; just the pair of places it knows. We'll call this neuron or small group of neurons a "patch comparator".

From an engineering perspective, the PE concept is both seductively simple and horribly frightening. If I were designing a hardware solution from scratch, I imagine it would be quite easy to implement, and could execute very quickly. When I think about a software simulation of such a machine, though, it's clear to me that it would be terribly slow to run. Imagine every pixel in the scene having a large number of patch comparators associated with it. Each one would look at a small patch - maybe 5 x 5 pixels, for instance - around that pixel and at the same size patch somewhere else in the scene. One comparator might look 20 pixels to the left, another might look 1 pixel above that, another 2 pixels above, and so on until there's a sufficient amount of coverage within a certain radius around the central patch being compared. There could literally be thousands of patch comparisons done for just one single pixel in a single snapshot. Such an algorithm would not perform very quickly, to say the least.

Let's say the output of each patch comparator is a value from 0 to 1, where 1 indicates that the two patches are, pixel for pixel, identical and 0 means they are definitively different. Any value between indicates varying degrees of similarity.

One might well ask what the output of such a process is. What's the point? To be honest, I'm still not entirely sure, yet. It's a bit like asking what a brain would do with edge detection output. To my knowledge, nobody really knows in much detail, yet.

Still, I can easily see how patch equivalence could be used in many facets of input processing. Consider binocular vision, for example. You've got images coming from both eyes and you generally want to match up the objects you see in each eye, in part to help you know how far each is. One patch comparator could be looking at one place in one eye and the same place in the other. Another comparator could then be looking at the same place in the left eye as before, but in a different place in the right eye, for instance. Naturally, there would be all sorts of "false positive" matches. But if we survey a bunch of comparators that are looking at the same offset and most of them are seeing matches with that offset, we would take the consensus as indicating a likelihood that we have a genuine match. We'd throw out all the other spurious matches as noise, for lack of a regional consensus.

Pattern detection is another example of where this technique can be used. Have you ever studied a printed table of numeric or textual data where one column contains mostly a single value (e.g., "100" or "Blue")? Perhaps it's a song playlist with a dozen songs from one album, followed by a dozen from another. You scan down the list and see the name of the first album is the same for the upper dozen songs. You don't even have to read them, because your visual system tells you they all look the same. That's pretty amazing, when you think about it. In fact, I've found I can scan down lists of hundreds of identical things looking for an exception, and can do it surprisingly quickly. It's not special to me, of course; we all can. How is it that my eyes instantly pick up the similarity and call out one item that's different? It's a repeating pattern, just like a checker board or bathroom tiles. A patch equivalence algorithm would find excellent use here. Given an offset roughly equal to the distance between the centers of two neighboring lines of text, a region of comparators would quickly come to a consensus that there's equivalence at that offset. Because it's at the same time and from the same eye, the conclusion would be that it's probably from a repeating pattern. As a side note, this doesn't sufficiently explain how we detect less regular patterns, like a table full of differently colored candies, but I suspect PE can play a role in explaining that, too.

What about motion? PE can help here, too. Imagine a layer of PE comparators that study the image seen by one eye now with the same image seen a fraction of a second ago. A ball is moving through the scene, so the ball sits in one place in one image and perhaps sits a little to the right if that in the next image. Again, one region of patch comparators that sees the ball in its before and after positions lights up in consensus and thus effectively reports the position and velocity of the moving object.

I've focused on vision, but I do believe the patch equivalence concept can apply to other senses. Consider the act of listening to a song. The tempo is easily detected very quickly for most songs, and that alone can be explained by reference to PE comparators that are looking at linear patches of frequency responses at different time offsets. Or it could be looking not at low level frequency responses, but instead at recognized patterns that represent snippets of instruments at different frequencies. In fact, it may well be that we mainly come to recognize distinct sounds as distinct only because they are repeated. A comparator might be looking at one two dimensional patch that's actually made up of several frequency bands in a small snippet of time and looking at the same kind of patch at a different point in time. If it sees the same exact response in both moments, this fact could result in saving that patch's pattern in short-term memory for later. More repetitions could continue to reinforce this pattern until it's saved for longer term recollection.

This same principal of selecting patch patterns that repeat in space or time provides a strong explanation of how patterns would come to be considered important enough to remember. This is a rather hard problem in AI, now, in large part because selecting important features seems to presuppose the idea that you can find punctuations between features -- an a priori definition of "important" -- like pauses between words or empty spaces around objects in a scene. Using PE, this may not even be necessary, and potentially provides a more amorphous conception of what a boundary really is.

Tuesday, July 12, 2005

Machine vision: motion-based segmentation

[Audio Version]

I've been experimenting, with limited success, with different ways of finding objects in images using what some vision researchers would call "preattentive" techniques, meaning not involving special knowledge of the nature of the objects to be seen. The work is frustrating in large part because of how confounding real-world images can be to simple analyses and because it's hard to nail down exactly what the goals for a preattentive-level vision system should be. In machine vision circles, this is generally called "segmentation", and usually refers more specifically to segmentation of regions of color, texture, or depth.

Jeff Hawkins (On Intelligence) would say that there's a general-purpose "cortical algorithm" that starts out naive and simply learns to predict how pixel patterns will change from moment to moment. Appealingly simple as that sounds, I find it nearly impossible to square with all I've been learning about the human visual system. From all the literature I've been trying to absorb, it's become quite clear that we still don't know much at all about the mechanisms of human vision. We have a wealth of tidbits of knowledge, but still no comprehensive theory that can be tested by emulation in computers. And it's equally clear nobody in the machine vision realm has found an alternative pathway to general purpose vision, either.

Segmentation seems a practical research goal for now. There has already been quite a bit of research into segmentation based on edges, on smoothly continuous color areas, on textures, and based on binocular disparity. I'm choosing to pursue something I can't seem to find literature on: segmentation of "layers" in animated, three dimensional scenes. Donald D. Hoffman (Visual Intelligence) makes the very strong point that our eyes favor "generic views". If we see two lines meeting at a point in a line drawing, we'll interpret the scene as representing two lines that meet at a point in 3D space, for example. The lines could be interpreted as having their endpoints coincidentally meeting, even though in the Z axis, they may be very far apart, but the concept of generic views says that that sort of coincidence would be so statistically unlikely that we can assume it just doesn't happen.

The principle of generic views seems to apply in animations as well. Picture yourself walking along a path through a park. Things around you are not moving much. Imagine you take a picture once for every step you take in which the center of the picture is always fixed on some distant point and you are keeping the camera level. Later, you study the sequence of pictures. For each pair of adjacent pictures in the sequence, you visually notice that very little seems to change. Yet when you inspect each pixel of the image, a great many of them do change color. You wonder why, but you quickly realize what's happening is that the color of one pixel in the before picture has more or less moved to another location in the after picture. As you study more images in the sequence, you notice a consistent pattern emerging. Near the center point in each image, the pixels don't move very much from frame to frame and the ones farther from the center tend to move in ever larger increments and almost always in a direction that radiates away from the center point.

You're tempted to conclude that you could create a simple algorithm to track the components of sequences captured in this way by simply "smearing" the previous image's pixels outward using a fairly simple mathematical equation based on each pixel's position with respect to the center, but something about the math doesn't seem to work out quite right. With more observation, you notice that trees and rocks alongside the path that are nearer to you than, say, the bushes behind them act a little differently. Their pixels move outward slightly faster than those of the bushes behind them. In fact, the closer an object is to you as you pass it, the faster its pixels seem to morph their way outward. The pixels in the far off hills and sky don't move much at all, for example.

At one point during the walk, you took a 90° left turn in the path and began fixating the camera on a new point. The turn took about 40 frames. In that time, we lost that fixed central point, but the intermediate frames seem to act in the same sort of way. This time, though, instead of smearing radially outward from a central point, the pixels appear to be shoved rapidly to the right of the field of view. It's almost as though we just had a very large bitmap image that we could only see a small rectangle of that was moving over that larger image.

By now, I hope I've impressed on you the idea that in a video stream of typical events in life, much of what is happening from frame to frame is largely a subtle shifting of regions of pixels. Although I've been struggling lately to figure out an effective algorithm to take advantage of this, I am fairly convinced this is likely one of the basic operations that may be going on in our own visual systems. And even if it's not, it seems to be a very valuable technique to employ in pursuit of general purpose machine vision. There seem to be at least two significant benefits that can be gained from application of this principle: segmentation and suppression of uninteresting stimuli.

Dalmation hidden in an ambiguous scene Consider segmentation. You've probably seen a variant of the "hidden dalmation" image at right here in which the information is ambiguous enough that you have to look rather carefully to grasp what you are looking at. What makes such an illusion all the more fascinating is when it starts out with an even more ambiguous still image that then begins into animation as the dog walks. The dog jumps right out of ambiguity. (Unfortunately, I couldn't find a video of it online to show.) I'm convinced that the reason that the animated version is so much easier to process is that the dog as a whole and its parts move consistently along their own paths from moment to moment as the background moves along its own path and that we see the regions as separate. What's more, I'm confident we also instantly grasp that the dog is in front of the background and not the other way around because we see parts of the background disappearing behind the parts of the dog, which don't get occluded by the background parts.

Motion-based segmentation of this sort seems more computationally complicated than, say, using just edges or color regions, but it carries with it this very powerful value of clearly placing layers in front of or behind one another. What's more, it seems it should be fairly straightforward to take parts that get covered up in subsequent frames and others that get revealed to actually build more complete images of parts of a scene that are occasionally covered by other things.

Another way of looking at why motion-based segmentation of this sort is special, consider the fact that it lets something that might otherwise be very hard to segment out using current techniques, such as a child against a graffiti-covered wall, stand out in a striking fashion as it moves in some way different from its background.

Now consider suppression of uninteresting stimuli. It seems in humans that our gaze is generally drawn to rapid or sudden motions in our fields of view. It's easy to see this by just standing around in a field as birds fly about, for instance, or on a busy street, for another. What's more, rapid motions that are unexpected which appear in even the farthest periphery of your visual field are likely to draw your attention away from otherwise static views in front of you. If you wanted to implement this in a computer, it would be pretty easy if the camera were stationary. You simply make it so each pixel slowly gets used to the ambient color and gets painted black. Only pixels that vary dramatically from the ambient color get painted some other color. Then you would use fairly conventional techniques to measure the size and central position of such moving blobs. But what if the camera were in the front windshield watching ahead as you drive? If you could identify the different segments that are moving in their own ways, you could probably fairly quickly get around to ignoring the ambient background. Things like a car changing lanes in front of you or a street sign passing overhead would be more likely to stand out because of their differing relative motions.

I'm in the process of trying to create algorithms to implement this concept of motion-based visual segmentation. To be honest, I'm not having much luck. This may be in part because I haven't much time to devote to it, but it's surely also because it's not easy. So far, I've experimented a little with the idea of searching the entire after image for candidates where a pixel in the before image might have gone in the hopes of narrowing down the possibilities by considering that pixel's neighbors' own candidate locations. Each candidate location would be expressed as an offset vector, which means that neighboring candidates' vectors can easily be compared to see how different they are from one another. When neighboring pixels all move together, they will have identical offset vectors, for instance. I haven't completed such an algorithm, though, because it's not apparent to me that this would be enough without a significant amount of crafty optimization. The number of candidates seems to be quite large, especially if all the pixels in the after image are potential candidates for movement of each pixel in the before image.

One other observation I've made that could have an impact on improving performance is that it seems that most objects that can be segmented out using this technique probably have fairly strongly defined edges around them, any way. Hence, it may make sense to assume that the pixels around one pixel will probably be in the same patch as that one unless they are along edge boundaries. Then, it's up for grabs. Conversely, it may be worthwhile considering only edge pixels' motions. This seems like it would garner more dubious results, but may be faster because it could require consideration of fewer pixels. One related fact is that it should be that the side of an edge which is on the nearer region should remain fairly constant, while the side in the farther region will be changing over time as parts of the background are occluded or revealed. This fact may help in identifying which apparent edges represent actual boundaries between foreground and background regions and particularly in determining which side of the edge is foreground and which background.

I'm encouraged, actually, by a somewhat related technology that may be able to be applied to this problem. I suspect that this same technique is used in our own eyes for binocular vision. That is, the left and right eye images in a given moment are a lot like adjacent frames in an animation: subtly shifted versions of one another. Much hard research has gone into making practical 3D imaging systems that use two or more ordinary video cameras in a radial array around a subject, such as a person sitting in a chair. Although the basic premise seems pretty straightforward, I've often wondered how they actually figure out how a given pixel maps to another one. The constraint of pixel shifting being offset in only the horizontal direction probably helps a lot, but I strongly suspect that people with experience developing these techniques would be well placed to engineer a decent motion-based segmentation algorithm.

One feature that will probably confound the attempt to engineer a good solution is the fact that parts of an image may rotate as well as shift and change apparent size (moving forward and backward). Rotation means that the offset vectors for pixels will not be simply be nearly identical, as in simple shifting. It should mean that the vectors vary subtly from pixel to pixel, rather like a color gradient in a region shifting subtly from green to blue. The same should go for changes in apparent size. The good news is that once the pixels are mapped from before to after frames, the "offset gradients" should tell a lot about the nature of the relative motion. It should, for example, be fairly straightforward to tell if rotation is occurring and find its central axis. And it should be similarly straightforward to tell if the object is apparently getting larger or smaller and hence moving towards or away from the viewer.

Monday, June 20, 2005

Machine vision: spindles

[Audio Version]

Following is another in my series of ad hoc journal entries I've been keeping of my thoughts on machine vision.

Maybe I'm just grasping at straws, but I recently realized one can separate out a new kind of primitive visual element. Every day, we're surrounded by thin, linear structures. Power lines, picture frames, pin-striped shirts, and trunks of tall trees are all great examples of what I mean. A line drawing is often nothing more than these thin, linear structures, and most written human languages are predominated by them.

The first word that comes to mind when I think about these things is "spindles".

On one hand, it seems hard to imagine that we have some built-in way to recognize and deal with spindles as a primitive kind of shape like we might with, say, basic geometric shapes (e.g., squares and circles) or features like edges or regions. But something about them seems tempting from the perspective of machine vision goals. Spindly structures in an image are obviously at least two dimensional, technically, yet ask a human to draw them and he'll most likely just draw thin lines. They're not just edges; not simply where one surface ends and a new one begins. They have their own colors and hence thickness.

Perhaps what makes spindles interesting to me is that it seems as though one could come up with a practical way of segregating spindles out of an image that may be easier than picking out, say, broad regions based on color blobs, texture spans, or edges. Finding blobs is hard in large part because it's hard to describe in simple terms what a given blob's shape is. A few blurry pixels along an otherwise sharp edge can bring a basic region growing technique to its knees and leave the researcher frustrated into hand adjusting cutoff thresholds to get the results he desires.

But spindles might be easier. Characterizing and recognizing a thin structure should be easier than an arbitrarily shaped blob. Even if the spindle is curved, branching, or somewhat jagged, it may still be easier than dealing with blobs. What's more, it's possible to compare the various spindles in an image to search for patterns that might give hints about 3D structures. Looking down a brick wall and you might pick out the horizontal white mortar lines as spindles and note that they all have a common vanishing point and thus hypothesize a 3D interpretation.

Spindles seem to come in two basic 3D flavors: colored edges and floating structures. The distinction, from a low level perspective, seems to be in whether what's on either side of a spindle is the same color or pattern or not. An overhead power line divides the sky, which is the same on both sides. A picture frame provides an enhancement of the boundary between a picture and the wall. Perhaps the similarity of the colors and textures on either side of a spindle also provide some basic suggestions about whether a given spindle is attached to one or both sides or is otherwise free-floating. The concept of "generic views" would say that it seems hard to imagine the frame around a picture might be floating in space in such a way that it would exactly line up with the picture, so the most plausible explanation is that it's no coincidence that the picture frame is actually in the same place as the picture. Whether it's attached to the wall or floating in space is a different question. So spindles can be helpful 3D cues.

I don't know whether to suggest that the human visual system sees spindles as a somehow separate sort of primitive, but it seems plausible. The very fact that printed characters in most all human languages are composed of spindles seems suggestive. Maybe it's because it's economical to write in strokes instead of blobs, but maybe it's more fundamental than that. It's also interesting that we have little trouble understanding technical "some assembly required" line drawings, even when they have no color, shading, or other 3D visual cues.

Perhaps spindles provide a way to explain how it is that a line drawing of a circle can be interpreted just as easily as a hollow hoop or a solid disk. That is, perhaps spindles are considered interchangeable with edges by our visual systems. Yet perhaps it's also that spindles stand out better than edges do.

Thursday, June 16, 2005

Machine vision: smoothing out textures

[Audio Version]

Following is another in my series of ad hoc journal entries I've been keeping of my thoughts on machine vision.

While reading up more on how texture analysis has been dealt with in recent years, I realized yesterday that there may be a straightforward way to do basic region segmentation based on textures. I consider texture-based segmentation to be one of two major obstacles to stepping above the trivial level of machine vision today toward general purpose machine vision. The other regards recognizing and ignoring illumination effects.

Something struck me a few hours after reading how one researcher chose to define textures. Among others, he made two interesting points. First, that a smooth texture must, when sampled at a sufficiently low resolution, cease to be a texture and instead become a homogeneous color field. Second, the size of a texture must be at least several times larger in dimensions (e.g., width and height) than a single textural unit. A rectangular texture composed of river rocks, for example, must be several rocks high and several rocks wide to actually be a texture.

Later, when I was trying to figure out what characteristics are worth consideration for texture-based segmentation that don't require me to engage in complicated mathematics, I remembered the concept I had been pursuing recently when I started playing with video image processing. I thought I could kill two birds with one stone by processing very low-res versions of source images: eliminating fine textures and reducing the number of pixels to process. I was disappointed, though, by the fact that low-res meant little information of value.

I realized that there was another way to get the benefit of blurring textures into smooth color fields without actually blurring (or lowering the resolution of - same thing) the images, per se. The principle is as follows.

Imagine an image that includes a sizable sampling of a texture. Perhaps it has a brick wall in the picture with no shadows, graffiti, or other significant confounding inclusions on the wall. The core principle is that there is a circle with a smallest radius CR (critical radius) that is large enough to be "representative" of at least one unit of that texture. In this case, what determines if it is representative is whether the circle can be placed anywhere within the bounds of the texture - the wall, in our example - and the average color of all the pixels within it will be almost exactly the same as if the circle were set anywhere else in that single-texture region.

If we want to identify the brick wall's texture as standing apart from the rest of the image, then, we have to do two things in this context. One, we need to find that critical radius (CR). Two, we need to populate the wall-textured region with enough CR circles so no part is left untested, yet no CR circle extends beyond the region. The enclosed region, then, is the candidate region.

I suppose this could work with squares, too. It doesn't have to be circles, but there may be some curious symmetric effects that come into play that I'm not aware of. Let's limit the discussion to circles, though.

So how does one determine the critical radius? A single random test won't do, because we don't know in advance that the test circle actually falls within a texture without some a priori knowledge. Our goal is to discover, not just validate.

I propose a dynamically varying grid of test circles that looks for local consistencies. Picture a grid in which at each node, there is centered a circle. The circles should overlap in such a way that there are no gaps. That is, the radius should be at least half the distance between one node and the node one unit down and across from it. In the first step, the CR (radius) chosen and hence the grid spacing would be small - two pixels, for example. As the test progresses, CR might grow by a simple doubling process or by some other multiplier. The grid would cover the entire image under consideration. The process would continue upward until the CR values chosen no longer allow for a sufficient number of sample circles to be created within the image.

The result of each pass of this process would be a new "image", with one pixel per grid node in the source image. That pixel's color would be the average of the colors of all the pixels within the test circle at that node. We would then search the new image for smooth color blobs using traditional techniques. Any significantly large blobs would be considered candidates for homogeneous textures.

I'm not entirely sure exactly how to make use of this information, but there's something intuitively satisfying about it. I've been thinking for a while now that we note the average colors of things and that that seems to be an important part of our way of categorizing and recognizing things. A recent illustration of this for me is a billboard I see on my way to work. It has a large region of cloudy sky, but the image is washed to an orangish-tan. From a distance, it looks to me just like the surface of a cheese pizza. So even though I know better, my first impression whenever I see this billboard - before I think about it - is of a cheese pizza. The pattern is obviously of sky and bears only modest resemblance to a pizza, but the overall color is very right.

Perhaps one way to use the resulting tiers of color blobs is to break down and analyze textures. Let's say I have one uniform color blob at tier N. I can look at the pixels of the N - 1, higher resolution version of this same region. One question I might ask is whether those pixels too are consistent. If so, maybe the texture is really just a smooth color region. If not, then maybe I really did capture a rough but consistent texture. I might then try to see how much variation there is in that higher resolution level. Maybe I can identify the two or three most prominent colors. In my sky-as-cheese-pizza example, it's clear that I see the dusty orange and white blobs collectively as appearing pizza-like; it's not just the average of the two colors. I could also use other conventional texture analysis techniques like co-occurrence matrices. Once I have the smoothness point (resolution) for a given color blob, I can perhaps double or quadruple the resolution to get it sufficiently rough for single-pixel-distances common in such analysis instead of having resolutions so high that such techniques don't work well.

Critics will be quick to point out that all I'm capturing in this algorithm is the ambient color of a texture. I might have a picture of oak trees tightly packed and adjacent to tightly packed pine trees. The ambient color of the two kinds of trees' foliage might be identical and so I would see them as a single grouping. To that I say the quip is valid, but probably irrelevant. I think it's reasonable to hypothesize that our own eyes probably deal with ambient texture color "before" they get into details like discriminating patterns. Further, I think a system that can successfully discriminate purely based on ambient texture color would probably be much farther ahead than alternatives I've seen to date. That is, it seems very practical.

Besides, the math is very simple, which is a compelling reason to me for believing it's something like how human vision might work. I can imagine the co-occurrence concept playing a role, but the combinatorics for a neural network that doesn't regular change its physical structure seem staggering. By contrast, it may take a long time for a linear processor to go through all these calculations, but the function is so simple and repetitive that it's pretty easy to imagine a few cortical layers implementing it all in parallel and getting results very quickly.

As a side note, I'm pretty well convinced that outside the fovea, our peripheral vision is doing most of its work using simple color blobs. Once we know what an object is, we just assume it's there as its color blobs move gradually around the periphery until the group of them moves out of view. It seems we track movement there, not details. The rest is just an internal model of what we assume is there. This strengthens my sense that within the fovea, there may be a more detail-oriented version of this same principle at work.

What I haven't figured out yet is how to deal with illumination effects. I suspect the same tricks that would be used for dealing with an untextured surface that has illumination effects on it would also be used on the lower resolution images generated by this technique. That is, the two problems would have to be processed in parallel. They could not be dealt with one before the other, I think.

Wednesday, June 15, 2005

Machine vision: studying surface textures

[Audio Version]

Following is another in my series of ad hoc journal entries I've been keeping of my thoughts on machine vision.

It seems that one can't escape the complexities that come with texture. Previously, I had experimented with very low resolution images because they can blur a texture into a homogeneous color blob. There's a terrible tradeoff, though. The texture smoothes out while the edges get blocky and less linear. Too much information is lost. What's more, a lower resolution image will likely have more uneven distribution of similar but different colored pixels. A ball goes from having a texture with lots of local color similarity to a small number of pixels with unique colors.

Moreover, it's a struggle for me with my own excellent visual capabilities to really understand what's in such low resolution images. It can't be that good of a technique if the source images aren't intelligible to human eyes.

I think I will have to revisit the subject of studying textures. An appropriate venue would be a scene with a simple white or black backdrop and uniform-texture objects moving around in close proximity to the video camera. Objects might include a tennis ball, various rocks, pieces of fabric, plastic sheets, and so on. The goal would be to get an agent to "understand" such textures. One critical aspect of understanding would be that it could later identify a texture it has studied. The moving around of an object with a given texture is important. It's not enough to use a still image of a texture to really understand it. Textured surfaces tend to have wide variation in their appearances as they are moved about and reshaped. To recognize a texture requires that it be abstracted in a way that can overcome such variations.

Friday, June 10, 2005

Machine vision: pixel morphing

[Audio Version]

Following is another in my series of ad hoc journal entries I've been keeping of my thoughts on machine vision.

I'm entertaining the idea that our vision works using a sort of "pixel morphing" technique. To illustrate what I mean in the context of a computer program, imagine a black scene with a small white dot in it. We'll call this dot the "target". With each frame in time, the circle moves a little, smoothly transcribing a square over the course of, say, forty frames. That means the target is on each of the four edges for ten time steps.

The target starts at the top left corner and travels rightward. The agent watching this should be able to infer that the dot seen in the first frame is the same as in the second frame, even though it has moved, say, 50 pixels away. Let's take this "magic" step as a given. The agent hence infers that the target is moving at a rate of 50 pixels per step. In the third frame, it expects the target to be 50 pixels further to the right and looks for it there.

Eventually, the target reaches the right edge of the square and starts traversing downward along that edge. Our agent is expecting the target to be 50 pixels to the right in the next step and so looks for it there. It doesn't find it. Using an assumption that things don't usually just appear and disappear from view, the agent looks around for the target until it finds it. It now has a new estimate of where it will be in the next frame: 50 pixels below its position in the current frame.

Now, since the target is the only thing breaking up the black backdrop, it leaves something ambiguous. Is the target moving or is the scene moving, as might happen if a robot were falling over? We'll prefer to assume the entire scene is moving because there's nothing to suggest otherwise. So now let's draw a solid brown square around the invisible square the target traverses. The result looks like a white ball moving around inside a brown box. Starting from the first frame, again, we magically notice the target has moved 50 pixels to the right in the second frame. The brown square has not moved. We could interpret this as the ball moving in a stationary scene or as a moving scene with the brown box moving so as to perfectly offset the scene's motion. This latter interpretation seems absurd, so we conclude the target is what is moving. Incidentally, this helps explain why one prefers to think of the world as moving while the train he is on is "stationary". The train provides the main frame of reference in the scene, unless one presses his face against the window and so only sees the outside scene.

Now to explain the magic step. The target's position has changed from frame N to frame N + 1. It's now 50 pixels to the right. How can we programmatically infer that these two things are the same? For the second frame, we don't have any way to predict that the target will be 50 pixels to the right of its original position. What should happen is that the agent should assume that objects don't just disappear. Seeing the circle is missing, it should go looking for it. One approach might be to note that now there's a "new" blob in the scene that wasn't there before. The two are roughly the same size and color, so it seems reasonable to assume they are the same and to go from there. It becomes a collaboration between the interpretations of two separate frames.

But there's still some magic behind this approach. We simplified the world by just having flat-colored objects like the brown rectangle and the white circle. There are no shadows or other lighting effects and only a small number of objects to deal with. The magic part is that we were able to identify objects before we bothered to match them up. A vision system should probably be able to do matching up of parts of a picture before recognizing objects. But how?

One answer might be to treat a single frame as though it were made of rubber. Each pixel in it can be pushed in some direction, but one should expect the neighboring pixels will move in similar directions and distances, with that expectation falling off with distance from any given pixel that moves. Imagine a picture of a 3D cube with each side a different color rotating along an up/down axis, for example. You see the top of the cube and the sides nearest you. And the lighting is such that the front faces change color subtly as the cube rotates. Looking down the axis, the cube is rotating clockwise, which means you see the front faces moving from right to left.

Imagine the pixels around the top corner nearest you. You see color from three faces: the two front faces and the top face. Let's talk about frames 1 and 2, where 2 comes is right after frame 1. In frame 1 the corner we're looking at is a little off to the right of the center of the frame and in frame 1, it's a little left of the center. We want the agent considering these frames to intuit from frames 1 and 2 that the corner under consideration has moved and where it is. Now think of frame 1 as a picture made of rubber. Imagine stretching it so that it looks like frame 2. With your finger, you push the corner we're considering an inch to the left so it lines up with the same corner in frame 2. Other pixels nearby go with it. Now you do the same with the bottom corner just below it and it's starting to look a little more like frame 2. You do the same along the edge between these two corners until the edge is pretty straight and lines up the same edge in frame 2. And you do the same with each of the edges and corners you find in the image.

Interestingly, you can do this with frame 3, too. You can keep doing this with each frame, but eventually things "break". The left front face eventually is rotated out of view. All those pixels in that face can't be pushed anywhere that will fit. They have to be cut out and the gap closed, somehow. Likewise, a new face eventually appears on the right, and there has to be a way to explain its appearance. Still much of the scene is pretty stable in this model. Most pixels are just being pushed around.

How would such a mechanism be implemented in code? When the color of a pixel changes, the agent can't just look randomly for another pixel in the image that is the same color and claim it's the same one. Even the closest match might not be the same one. But what if each pixel were treated as a free agent that has to bargain with its nearby neighbors to come up with a consistent explanation that would, collectively, result in morphing of the sort described above? Strength in numbers would matter. Those pixels whose colors don't change would largely be non-participants. Only those that change from one frame to another would. From frame 1 to 2, pixel P changes color. In frame 1, pixel P was in color blob B1. In frame 2, P searches all the color blobs for the one whose center is closest to P that is strongly similar in color. It tries to optimize its choice on closeness in both distance and color. In the meantime, every other P that changes from frame 1 to 2 is doing the same. When it's all done, every changed-pixel P is reconsidered by reference to its neighbors. What to do next is not clear, though.

One thing that should come out of the collaborative process, though, is a kind of optimization. Once some pixels in the image have been solidly matched by morphing, they should give helpful hints to nearby pixels as to where to begin their searches as well. If pixel P has moved 50 pixels to the right and 10 down, the pixel next to P has probably also moved about 50 pixels to the right and 10 down.

In the case of the white circle moving around, it should be clear. But what if a white border were added around the brown square? The brown hole created as the white circle moves from that position to the next might result in all changed pixels P guessing that the white pixels in the border nearest the new brown hole are actually where the white circle went, but this doesn't make sense. Similarly, the new white circle in frame 2 could be thought to have come from out of the white border; again, this doesn't make sense.

One answer would be a sort of conservation of "mass" concept, where mass is the number of pixels in some color blob. The white circle in frame 2 could have come from the wall, but that would require creating a bunch of new white pixels. And the white circle in frame 1 could have disappeared into the white border, but this would require a complete loss of those pixels. Perhaps the very fact that we have a mass of pixels in one place in frame 1 and the same mass of the same color of pixels in another place in frame 2 should lead us to conclude that they are the same.

There's a lot of ground to cover with this concept. I think there's value to the idea of bitmap morphing. I think a great illustration of how this could be used by our own vision is how we deal with driving. Looking forward, the whole scene is constantly changing, but only subtly. Only the occasional bird darting by or other fast-moving objects screw up the impression of a single image that's subtly morphing.