Automatic alignment of stereo cameras
[Audio Version]
I'm currently working on developing a low-level stereo vision component tentatively called "Binoculus". It builds on the DualCameras component, which provides basic access to two attached cameras. To it, Binoculus already adds calibration and will hopefully add some basic ability to segment parts of the scene by perceived depth.
For now, I've only worked on getting the images from the cameras to be calibrated so they both "point" in the same direction. The basic question here is: once the cameras point roughly in the same direction, how many horizontal and vertical pixels off is the left one from the right? I had previously pursued answering this using a somewhat complicated printed graphic and a somewhat annoying process, because I was expecting I would have to deal with spherical warping, differing camera sizes, differing colors, and so on. I've come to the conclusion that this probably won't be necessary, and that all that probably will be is getting the cameras to agree on where an "infinity point" is.
This is almost identical to the question posed by a typical camera with auto-focus, except that I have to deal with vertical alignment in addition to the typical horizontal alignment. I thought it worthwhile to describe the technique here because I have had such good success with it and it doesn't require any special tools or machine intelligence.
We begin with a premise that if you take the images from the left and right cameras and subtract them, pixel for pixel, the closer the two images are to pointing at the same thing, the lower will be the sum of all pixel differences. To see what I mean, consider the following figure, which shows four versions of the same pair of images with their pixel values subtracted out:
From left to right, each shows the difference between the two images as they get closer to best alignment. See how they get progressively darker? As we survey each combined pixel, we're adding up the combined difference of red, green, and blue values. The ideal match would have a difference value of zero. The worst case would have a difference value of Width * Height * 3 * 255.
Now let's start with the assumption that we have the cameras perfectly aligned, vertically, so we only have to adjust the horizontal alignment. We start by aiming our camera pair at some distant scenery. My algorithm then takes a rectangular subsection of the eyes' images - about 2/5 of the total width and height - from the very center of the left eye. For the right eye, it takes another sample rectangle of the same exact size and moves it from the far left to the far right in small increments (e.g., 4 pixels). The following figure shows the difference values calculated for different horizontal offsets:
Notice how there's a very clear downward spike in one part of the graph? At the very tip of that is the lowest difference value and hence the horizontal offset for the right-hand sample box. That offset is, more generally, the horizontal offset for the two cameras and can be used as the standard against which to estimate distances to objects from now on.
As a side note, you may notice that there is a somewhat higher sample density near the point where the best match is. That's a simple optimization I added in to speed up processing. With each iteration, we take the best offset position calculated previously and have a gradually higher density of tests around that point, on the assumption that it will still be near there with the next iteration. Near the previous guessed position, we're moving our sampling rectangle over one pixel at a time, whereas we're moving it about 10 pixels at the periphery.
What about the vertical alignment? Technically speaking, we should probably do the same thing I've just described over a 2D web covering the entire right-hand image, moving the rectangle throughout it. That would involve a high amount of calculation. I used a cheat, however. I start with the assumption that the vertical alignment starts out pretty close to what it should be because the operator is careful about alignment. So with each calibration iteration, my algorithm starts by finding the optimal horizontal position. It then runs the same test vertically, moving the sample rectangle from top to bottom along the line prescribed by the best-fitting horizontal offset. If the outcome says the best position is below where the current vertical offset value, we add one to it to push it one pixel downward. Conversely, if the best position seems to be above, we subtract one from the current offset value and so push it upward. The result is a gradual sliding up or down, whereas the horizontal offset calculated is instantly implemented. You can see the effects of this in the animation to the right. Notice how you don't see significant horizontal adjustments with each iteration, but you do see vertical ones?
Why do I gradually adjust the vertical offset? When I tried letting the vertical and horizontal alignments "fly free" from moment to moment, I was getting bad results. The vertical alignment might be way off because the horizontal was way off. Then the horizontal alignment, which is along the bad vertical offset, would perform badly and the cycle of bad results would continue. This is simply because I'm using a sort of vertical cross pattern to my scanning, instead of scanning in a wider grid pattern. This tweak, however, is quite satisfactory, and seems to work well in most of my tests so far.
I wish I could tell you that this works perfectly every time, but there is one bad behavior worth noting. Watch the animation above carefully. Notice how as the vertical adjustments occur, there is a subtle horizontal correction? Once the vertical offset is basically set, the horizontal offset switches back and forth one pixel about three times, too, before it settles down. I noticed this sort of vacillation in both the vertical and horizontal in many of my test runs. I didn't spend much time investigating the cause, but I believe it has to do with oscillations between the largely independent vertical and horizontal offset calculations. When one changes, it can cause the other to change, which in turn can cause the other to change back, ad infinitum. The solution generally appears to be to bump the camera assembly a little so it seem something that may agree with the algorithm a little better. I also found that using a sharply contrasting image, like the big, black dot I printed out, seems to be a little better than softer, more naturalistic objects like the picture frame you see above the dot.
It's also worth noting that it's possible that the vertical alignment could be so far off and the nature of the scene be such that the horizontal scanning might actually pick the wrong place to align with. In that case, the vertical offset adjustments could potentially head off in the opposite direction from what you expect. I saw this in a few odd cases, especially with dull or repeating patterned backdrops.
Finally, I did notice that there were some rare close-up scenes I tried to calibrate with in which the horizontal offset estimate was very good, but the vertical offset would move in the opposite direction from that desired. I never discovered the cause, but a minor adjustment of the cameras' direction would fix it.
When I started making this algorithm, it was to experiment with ways to segment out different objects based on distance from the camera. It quickly turned into a simple infinity-point calibration technique. What I like most about it is how basically autonomous it is. Just aim the cameras at some distant scenery, start the process, and let it go until it's satisfied that there's a consistent pair of offset values. When it's done, you can save the offset values in the registry or some other persistent storage and continue using it with subsequent sessions.
I'm currently working on developing a low-level stereo vision component tentatively called "Binoculus". It builds on the DualCameras component, which provides basic access to two attached cameras. To it, Binoculus already adds calibration and will hopefully add some basic ability to segment parts of the scene by perceived depth.
For now, I've only worked on getting the images from the cameras to be calibrated so they both "point" in the same direction. The basic question here is: once the cameras point roughly in the same direction, how many horizontal and vertical pixels off is the left one from the right? I had previously pursued answering this using a somewhat complicated printed graphic and a somewhat annoying process, because I was expecting I would have to deal with spherical warping, differing camera sizes, differing colors, and so on. I've come to the conclusion that this probably won't be necessary, and that all that probably will be is getting the cameras to agree on where an "infinity point" is.
This is almost identical to the question posed by a typical camera with auto-focus, except that I have to deal with vertical alignment in addition to the typical horizontal alignment. I thought it worthwhile to describe the technique here because I have had such good success with it and it doesn't require any special tools or machine intelligence.
We begin with a premise that if you take the images from the left and right cameras and subtract them, pixel for pixel, the closer the two images are to pointing at the same thing, the lower will be the sum of all pixel differences. To see what I mean, consider the following figure, which shows four versions of the same pair of images with their pixel values subtracted out:
From left to right, each shows the difference between the two images as they get closer to best alignment. See how they get progressively darker? As we survey each combined pixel, we're adding up the combined difference of red, green, and blue values. The ideal match would have a difference value of zero. The worst case would have a difference value of Width * Height * 3 * 255.
Now let's start with the assumption that we have the cameras perfectly aligned, vertically, so we only have to adjust the horizontal alignment. We start by aiming our camera pair at some distant scenery. My algorithm then takes a rectangular subsection of the eyes' images - about 2/5 of the total width and height - from the very center of the left eye. For the right eye, it takes another sample rectangle of the same exact size and moves it from the far left to the far right in small increments (e.g., 4 pixels). The following figure shows the difference values calculated for different horizontal offsets:
Notice how there's a very clear downward spike in one part of the graph? At the very tip of that is the lowest difference value and hence the horizontal offset for the right-hand sample box. That offset is, more generally, the horizontal offset for the two cameras and can be used as the standard against which to estimate distances to objects from now on.
As a side note, you may notice that there is a somewhat higher sample density near the point where the best match is. That's a simple optimization I added in to speed up processing. With each iteration, we take the best offset position calculated previously and have a gradually higher density of tests around that point, on the assumption that it will still be near there with the next iteration. Near the previous guessed position, we're moving our sampling rectangle over one pixel at a time, whereas we're moving it about 10 pixels at the periphery.
What about the vertical alignment? Technically speaking, we should probably do the same thing I've just described over a 2D web covering the entire right-hand image, moving the rectangle throughout it. That would involve a high amount of calculation. I used a cheat, however. I start with the assumption that the vertical alignment starts out pretty close to what it should be because the operator is careful about alignment. So with each calibration iteration, my algorithm starts by finding the optimal horizontal position. It then runs the same test vertically, moving the sample rectangle from top to bottom along the line prescribed by the best-fitting horizontal offset. If the outcome says the best position is below where the current vertical offset value, we add one to it to push it one pixel downward. Conversely, if the best position seems to be above, we subtract one from the current offset value and so push it upward. The result is a gradual sliding up or down, whereas the horizontal offset calculated is instantly implemented. You can see the effects of this in the animation to the right. Notice how you don't see significant horizontal adjustments with each iteration, but you do see vertical ones?
Why do I gradually adjust the vertical offset? When I tried letting the vertical and horizontal alignments "fly free" from moment to moment, I was getting bad results. The vertical alignment might be way off because the horizontal was way off. Then the horizontal alignment, which is along the bad vertical offset, would perform badly and the cycle of bad results would continue. This is simply because I'm using a sort of vertical cross pattern to my scanning, instead of scanning in a wider grid pattern. This tweak, however, is quite satisfactory, and seems to work well in most of my tests so far.
I wish I could tell you that this works perfectly every time, but there is one bad behavior worth noting. Watch the animation above carefully. Notice how as the vertical adjustments occur, there is a subtle horizontal correction? Once the vertical offset is basically set, the horizontal offset switches back and forth one pixel about three times, too, before it settles down. I noticed this sort of vacillation in both the vertical and horizontal in many of my test runs. I didn't spend much time investigating the cause, but I believe it has to do with oscillations between the largely independent vertical and horizontal offset calculations. When one changes, it can cause the other to change, which in turn can cause the other to change back, ad infinitum. The solution generally appears to be to bump the camera assembly a little so it seem something that may agree with the algorithm a little better. I also found that using a sharply contrasting image, like the big, black dot I printed out, seems to be a little better than softer, more naturalistic objects like the picture frame you see above the dot.
It's also worth noting that it's possible that the vertical alignment could be so far off and the nature of the scene be such that the horizontal scanning might actually pick the wrong place to align with. In that case, the vertical offset adjustments could potentially head off in the opposite direction from what you expect. I saw this in a few odd cases, especially with dull or repeating patterned backdrops.
Finally, I did notice that there were some rare close-up scenes I tried to calibrate with in which the horizontal offset estimate was very good, but the vertical offset would move in the opposite direction from that desired. I never discovered the cause, but a minor adjustment of the cameras' direction would fix it.
When I started making this algorithm, it was to experiment with ways to segment out different objects based on distance from the camera. It quickly turned into a simple infinity-point calibration technique. What I like most about it is how basically autonomous it is. Just aim the cameras at some distant scenery, start the process, and let it go until it's satisfied that there's a consistent pair of offset values. When it's done, you can save the offset values in the registry or some other persistent storage and continue using it with subsequent sessions.
Comments
Post a Comment