Search This Blog

Friday, August 26, 2005

Introduction to machine vision

[Audio Version]

Recently, I completely forgot to mention that I published a brief introduction to machine vision (click here) on August 14th. It's meant to be tailored to people who want to better understand the subject but haven't had much experience outside the popular media's thin portrayal of it. By contrast, much of what's written that gets into the nuts and bolts is often difficult to read because it requires complex math skills or otherwise expects you to have a fairly strong background in the subject, already.

I'm especially fond of demystifying subjects that look exceptionally complex. Machine vision often seems like a perfect example of pinheads making the world seem too complicated and their work more impressive than it really is. Sometimes it comes down to pure huckstery as cheap tricks and carefully groomed examples are employed in pursuit of funding or publicity. Then again, there's an awful lot of very good and creative work out there. It's fun to show that much of the subject can be approachable to even notice programmers and non-programmers.

I spent a few months putting the introduction together. I'm not entirely happy with the final result, as I imagined it would have a much broader scope. Ultimately, a lack of sufficient time to devote to it meant I had to leave out interesting applications of the basics like optical character recognition (OCR) and face recognition for fear that it would never be done.

I am, however, starting work on a less ambitious project to address more esoteric topics in machine vision. I should begin publishing drafts of early material within it very soon.

Sunday, August 14, 2005

Bob Mottram, crafty fellow

[Audio Version]

I sometimes use my rickety platform here to review new technologies and web sites, but I haven't done enough to give kudos to the unusual people in AI that dot the world and sometimes find their way online. Bob Mottram is one such person that deserves mention.

Bob Mottram with his creation, Rodney Who is Bob Mottram? He's a 33-ish year old British programmer who has found a keen interest in the field of Artificial Intelligence. He seems to be fairly well read on a variety of studies and technologies that are around. What starts to make him stand out is his active participation in the efforts. Like me, he finds that many of the documents out there that describe AI technologies sound tantalizingly detailed, but are actually very opaque when it comes to the details. Unlike most, however, he takes this simply as a challenge to surpass. He designs and codes and experiments until his results start to look like what is described in the literature.

The next thing that sets Mottram apart is his willingness to step outside the bounds of simply duplicating other people's work. He applies what he learns and hypothesizes about new ways of solving problems, going so far as to envision tackling the high goal of duplicating the inner workings of the brain in software.

Perhaps what really sets Bob Mottram apart, for me, is his willingness to take his work public. His web site ( is chock full not only of listings of projects he's worked on, but also keen and easy-to-read insights on what he's learned along the way. He also has the venerable habit of peppering his material with links to related content as background and credit.

Mottram's web site has a fascinating smattering of content about various projects he's worked on. The one that first got my attention was his "Rodney" project. Named after Rodney Brooks, creator of the famous Genghis and Cog robots, Rodney is Mottram's low-budget answer to Cog.

Rodney the robot

Through a set of successive iterations, Mottram had built Rodney to be ever more sophisticated as a piece of hardware, but more importantly, had continued to experiment with a variety of different sensing and control techniques. His project web site documents many of these experiments. He also makes available much of his source code.

What got my attention in the first place was his page on Rodney's vision system. Do a Google search on "robot stereo vision" or using a variety of related terms and you're likely to find Bob Mottram's page on his research. It's not necessarily that his work is really groundbreaking; it's just that he's one of the only people to really document his work. As I was doing background research for an upcoming introduction to machine vision, I found his site over and over again in relation to certain kinds of techniques he's implemented and documented.

Seeing the general utility of the vision system he was creating for Rodney, Mottram moved on to his Sentience project. The primary goal was to extract and make open-source a software component that can use input from two cameras to construct a 3D model of what the eyes see.

Other Mottram stuff

Mottram's web site includes plenty of other interesting and arcane experiments. Many are whimsical applications of his experiments with stereo vision and detecting motion and change in images, like a Space Invader type game where the player's image is transposed with the aliens or a program that detects people moving within a stationary webcam's field of view. Some delve deeper into new research, like his face detection and imitation work or his Robocore project.

Finally, Mottram has his very own blog. It's not specifically for AI, but does include various insights into the subject from time to time.

In all, I give Bob Mottram a good heap of credit for being a crafty fellow who is sincere in his belief in and pursuit of the goals of Artificial Intelligence. And he gets major kudos for sharing his work online for geeks like me. Do check out his web site.

Thursday, August 11, 2005

Stereo vision: measuring object distance using pixel offset

[Audio Version]

I've had some scraps of time here and there to put toward progress on my stereo vision experiments. My previous blog entry described how to calibrate a stereo camera pair to find X and Y offsets that correspond in the right camera with the same position in the left camera when they are both looking at a far off "infinity point". Once I had that, I knew it was only a small step to use the same basic algorithm for dynamically getting the two cameras "looking" at the same thing even when the subject matter is close enough for the two cameras to actually register a difference. And since I have the vertical offset already calculated, I was happy to see the algorithm running along this single horizontal "rail" runs faster.

The next logical step, then, was to see if I could figure out the formula for telling how far away what the cameras are looking at is from the cameras. This is one of the core reasons for using a pair of cameras instead of one. I looked around the web for some useful explanation or diagrams. I found lots of inadequate diagrams that show the blatantly obvious, but nothing that was complete enough to develop a solution from.

So I decided to develop my own solution using some basic trigonometry. It took a while and a few headaches, but I finally got it down. I was actually surprised at how well it worked. I thought I should publish the method I used in detail so other developers can get past this more quickly. The following diagram graphically illustrates the concept and the math, which I explain further below.

Using pixel offset to measure distance

I suppose I'm counting on you knowing what the inputs are. If you do, skip the next paragraph. Otherwise, this may help.

Combined image of a hand from left and right cameras The pink star represents some object that the two cameras are looking at. Let's assume the cameras are perfectly aligned with each other. That is, when the object is sufficiently far away -- say, 30 feet or more for cameras that are 2.5 inches apart -- and you blend the images from both cameras, the result looks the same as if you just looked at the left or right camera image. But if you stick your hand in front of the camera pair at, say, 5 feet away and look at the combined image, you see two "hands" partly overlapping. Let's say you measured the X (horizontal) offset of one version of the hand from the other as being about 20 pixels. Now, you change the code to overlap the pictures so that the right-hand one is offset by 20 pixels. Now the two hands perfectly overlap and it's the background scene that's doubled-up. The diagram above is suggestive of this in the middle section, where the pink star is in different places in the left and right camera "projections". These projections are really just the images that are output. Now that you grasp the idea that the object seen by the two cameras is the same, but simply offset to different positions in each image, we can move on. Assume for now that we already have code that can measure the offset in pixels I describe above.

Once I got through the math, I made a proof of concept rig to calculate distance. I simply tweaked the "factor" constant by hand until I started getting distances to things in the room that jibed with what my tape measure said. Then I went on to work the math backward so that I could enter a measured distance and have it calculate the factor, instead. I packaged that up into a calibration tool.

I expected it would work fairly well, but I was truly surprised at how accurate it is, given the cheap cameras I have and the low resolution of the images they output. I found with objects I tested from two to ten feet away, the estimated distance was within two inches of what I measured using a tape measure. That's accurate enough, in my opinion, to build a crude model of a room for purposes of navigating through it, a common task in AI for stereo vision systems.

I haven't yet seen how good it is at distinguishing distance in objects that are very close to one another using this mechanism. We can easily discriminate depth offsets of a millimeter on objects within two feet. These cameras are not that good, so I doubt they'll be as competent.

So now I have a mechanism that does pretty well at taking a rectangular portion of a scene and finding the best match it can for that portion in the other eye and using it to calculate the distance based on the estimated offset. The next step, then, is to repeat this over the entire image and with ever smaller rectangular regions. I can already see some important challeges, like what to do when the rectangle just contains a solid color or a repeating pattern, but these seem modest complications to an otherwise fairly simple technique. Cool.

[See also Automatic Alignment of Stereo Sameras.]

Sunday, August 7, 2005

Automatic alignment of stereo cameras

[Audio Version]

I'm currently working on developing a low-level stereo vision component tentatively called "Binoculus". It builds on the DualCameras component, which provides basic access to two attached cameras. To it, Binoculus already adds calibration and will hopefully add some basic ability to segment parts of the scene by perceived depth.

For now, I've only worked on getting the images from the cameras to be calibrated so they both "point" in the same direction. The basic question here is: once the cameras point roughly in the same direction, how many horizontal and vertical pixels off is the left one from the right? I had previously pursued answering this using a somewhat complicated printed graphic and a somewhat annoying process, because I was expecting I would have to deal with spherical warping, differing camera sizes, differing colors, and so on. I've come to the conclusion that this probably won't be necessary, and that all that probably will be is getting the cameras to agree on where an "infinity point" is.

This is almost identical to the question posed by a typical camera with auto-focus, except that I have to deal with vertical alignment in addition to the typical horizontal alignment. I thought it worthwhile to describe the technique here because I have had such good success with it and it doesn't require any special tools or machine intelligence.

We begin with a premise that if you take the images from the left and right cameras and subtract them, pixel for pixel, the closer the two images are to pointing at the same thing, the lower will be the sum of all pixel differences. To see what I mean, consider the following figure, which shows four versions of the same pair of images with their pixel values subtracted out:

Subtracting two images at different overlap offsets

From left to right, each shows the difference between the two images as they get closer to best alignment. See how they get progressively darker? As we survey each combined pixel, we're adding up the combined difference of red, green, and blue values. The ideal match would have a difference value of zero. The worst case would have a difference value of Width * Height * 3 * 255.

Now let's start with the assumption that we have the cameras perfectly aligned, vertically, so we only have to adjust the horizontal alignment. We start by aiming our camera pair at some distant scenery. My algorithm then takes a rectangular subsection of the eyes' images - about 2/5 of the total width and height - from the very center of the left eye. For the right eye, it takes another sample rectangle of the same exact size and moves it from the far left to the far right in small increments (e.g., 4 pixels). The following figure shows the difference values calculated for different horizontal offsets:

Notice how there's a very clear downward spike in one part of the graph? At the very tip of that is the lowest difference value and hence the horizontal offset for the right-hand sample box. That offset is, more generally, the horizontal offset for the two cameras and can be used as the standard against which to estimate distances to objects from now on.

As a side note, you may notice that there is a somewhat higher sample density near the point where the best match is. That's a simple optimization I added in to speed up processing. With each iteration, we take the best offset position calculated previously and have a gradually higher density of tests around that point, on the assumption that it will still be near there with the next iteration. Near the previous guessed position, we're moving our sampling rectangle over one pixel at a time, whereas we're moving it about 10 pixels at the periphery.

What about the vertical alignment? Technically speaking, we should probably do the same thing I've just described over a 2D web covering the entire right-hand image, moving the rectangle throughout it. That would involve a high amount of calculation. I used a cheat, however. I start with the assumption that the vertical alignment starts out pretty close to what it should be because the operator is careful about alignment. So with each calibration iteration, my algorithm starts by finding the optimal horizontal position. It then runs the same test vertically, moving the sample rectangle from top to bottom along the line prescribed by the best-fitting horizontal offset. If the outcome says the best position is below where the current vertical offset value, we add one to it to push it one pixel downward. Conversely, if the best position seems to be above, we subtract one from the current offset value and so push it upward. The result is a gradual sliding up or down, whereas the horizontal offset calculated is instantly implemented. You can see the effects of this in the animation to the right. Notice how you don't see significant horizontal adjustments with each iteration, but you do see vertical ones?

Why do I gradually adjust the vertical offset? When I tried letting the vertical and horizontal alignments "fly free" from moment to moment, I was getting bad results. The vertical alignment might be way off because the horizontal was way off. Then the horizontal alignment, which is along the bad vertical offset, would perform badly and the cycle of bad results would continue. This is simply because I'm using a sort of vertical cross pattern to my scanning, instead of scanning in a wider grid pattern. This tweak, however, is quite satisfactory, and seems to work well in most of my tests so far.

I wish I could tell you that this works perfectly every time, but there is one bad behavior worth noting. Watch the animation above carefully. Notice how as the vertical adjustments occur, there is a subtle horizontal correction? Once the vertical offset is basically set, the horizontal offset switches back and forth one pixel about three times, too, before it settles down. I noticed this sort of vacillation in both the vertical and horizontal in many of my test runs. I didn't spend much time investigating the cause, but I believe it has to do with oscillations between the largely independent vertical and horizontal offset calculations. When one changes, it can cause the other to change, which in turn can cause the other to change back, ad infinitum. The solution generally appears to be to bump the camera assembly a little so it seem something that may agree with the algorithm a little better. I also found that using a sharply contrasting image, like the big, black dot I printed out, seems to be a little better than softer, more naturalistic objects like the picture frame you see above the dot.

It's also worth noting that it's possible that the vertical alignment could be so far off and the nature of the scene be such that the horizontal scanning might actually pick the wrong place to align with. In that case, the vertical offset adjustments could potentially head off in the opposite direction from what you expect. I saw this in a few odd cases, especially with dull or repeating patterned backdrops.

Finally, I did notice that there were some rare close-up scenes I tried to calibrate with in which the horizontal offset estimate was very good, but the vertical offset would move in the opposite direction from that desired. I never discovered the cause, but a minor adjustment of the cameras' direction would fix it.

When I started making this algorithm, it was to experiment with ways to segment out different objects based on distance from the camera. It quickly turned into a simple infinity-point calibration technique. What I like most about it is how basically autonomous it is. Just aim the cameras at some distant scenery, start the process, and let it go until it's satisfied that there's a consistent pair of offset values. When it's done, you can save the offset values in the registry or some other persistent storage and continue using it with subsequent sessions.

DualCameras component

[Audio Version]

I have been getting more involved in stereo, or "binocular", vision research. So far, most of my actual development efforts have been on finding a pair of cameras that will work together on my computer, an annoying challenge, to be sure. Recently, I found a good pair, so I was able to move on to the next logical step: creating an API for dealing with two cameras.

Using C#, I created a Windows control component that taps into the Windows Video Capture API and provides a very simple interface. Consumer code needs only start capturing, tell it to grab frames from time to time when it's ready, and eventually (optionally) to stop capturing. There's no question of synchronizing or worrying about a flood of events. I dubbed the component DualCameras and have made it freely available for download, including all source code and full documentation.

DualCameras component

I've already been using the component for a while now and have made some minor enhancements, but I'm happy to say it has just worked this whole time; no real bugs to speak of. It's especially nice to know how all the wacky window creation and messaging that goes on under the surface is quietly encapsulated and that the developer need not understand any of it to use the components. Just ask for a pair of images and it will wait until it has them both. Simple. I certainly can't say that of all the programs I've made.

The home page I made for the component also has advice about how to select a pair of cameras. I went through a bunch of different kinds before I found one that worked, so I thought I'd share my experience to help save others some headaches.