Machine vision: motion-based segmentation
[Audio Version]
I've been experimenting, with limited success, with different ways of finding objects in images using what some vision researchers would call "preattentive" techniques, meaning not involving special knowledge of the nature of the objects to be seen. The work is frustrating in large part because of how confounding real-world images can be to simple analyses and because it's hard to nail down exactly what the goals for a preattentive-level vision system should be. In machine vision circles, this is generally called "segmentation", and usually refers more specifically to segmentation of regions of color, texture, or depth.
Jeff Hawkins (On Intelligence) would say that there's a general-purpose "cortical algorithm" that starts out naive and simply learns to predict how pixel patterns will change from moment to moment. Appealingly simple as that sounds, I find it nearly impossible to square with all I've been learning about the human visual system. From all the literature I've been trying to absorb, it's become quite clear that we still don't know much at all about the mechanisms of human vision. We have a wealth of tidbits of knowledge, but still no comprehensive theory that can be tested by emulation in computers. And it's equally clear nobody in the machine vision realm has found an alternative pathway to general purpose vision, either.
Segmentation seems a practical research goal for now. There has already been quite a bit of research into segmentation based on edges, on smoothly continuous color areas, on textures, and based on binocular disparity. I'm choosing to pursue something I can't seem to find literature on: segmentation of "layers" in animated, three dimensional scenes. Donald D. Hoffman (Visual Intelligence) makes the very strong point that our eyes favor "generic views". If we see two lines meeting at a point in a line drawing, we'll interpret the scene as representing two lines that meet at a point in 3D space, for example. The lines could be interpreted as having their endpoints coincidentally meeting, even though in the Z axis, they may be very far apart, but the concept of generic views says that that sort of coincidence would be so statistically unlikely that we can assume it just doesn't happen.
The principle of generic views seems to apply in animations as well. Picture yourself walking along a path through a park. Things around you are not moving much. Imagine you take a picture once for every step you take in which the center of the picture is always fixed on some distant point and you are keeping the camera level. Later, you study the sequence of pictures. For each pair of adjacent pictures in the sequence, you visually notice that very little seems to change. Yet when you inspect each pixel of the image, a great many of them do change color. You wonder why, but you quickly realize what's happening is that the color of one pixel in the before picture has more or less moved to another location in the after picture. As you study more images in the sequence, you notice a consistent pattern emerging. Near the center point in each image, the pixels don't move very much from frame to frame and the ones farther from the center tend to move in ever larger increments and almost always in a direction that radiates away from the center point.
You're tempted to conclude that you could create a simple algorithm to track the components of sequences captured in this way by simply "smearing" the previous image's pixels outward using a fairly simple mathematical equation based on each pixel's position with respect to the center, but something about the math doesn't seem to work out quite right. With more observation, you notice that trees and rocks alongside the path that are nearer to you than, say, the bushes behind them act a little differently. Their pixels move outward slightly faster than those of the bushes behind them. In fact, the closer an object is to you as you pass it, the faster its pixels seem to morph their way outward. The pixels in the far off hills and sky don't move much at all, for example.
At one point during the walk, you took a 90° left turn in the path and began fixating the camera on a new point. The turn took about 40 frames. In that time, we lost that fixed central point, but the intermediate frames seem to act in the same sort of way. This time, though, instead of smearing radially outward from a central point, the pixels appear to be shoved rapidly to the right of the field of view. It's almost as though we just had a very large bitmap image that we could only see a small rectangle of that was moving over that larger image.
By now, I hope I've impressed on you the idea that in a video stream of typical events in life, much of what is happening from frame to frame is largely a subtle shifting of regions of pixels. Although I've been struggling lately to figure out an effective algorithm to take advantage of this, I am fairly convinced this is likely one of the basic operations that may be going on in our own visual systems. And even if it's not, it seems to be a very valuable technique to employ in pursuit of general purpose machine vision. There seem to be at least two significant benefits that can be gained from application of this principle: segmentation and suppression of uninteresting stimuli.
Consider segmentation. You've probably seen a variant of the "hidden dalmation" image at right here in which the information is ambiguous enough that you have to look rather carefully to grasp what you are looking at. What makes such an illusion all the more fascinating is when it starts out with an even more ambiguous still image that then begins into animation as the dog walks. The dog jumps right out of ambiguity. (Unfortunately, I couldn't find a video of it online to show.) I'm convinced that the reason that the animated version is so much easier to process is that the dog as a whole and its parts move consistently along their own paths from moment to moment as the background moves along its own path and that we see the regions as separate. What's more, I'm confident we also instantly grasp that the dog is in front of the background and not the other way around because we see parts of the background disappearing behind the parts of the dog, which don't get occluded by the background parts.
Motion-based segmentation of this sort seems more computationally complicated than, say, using just edges or color regions, but it carries with it this very powerful value of clearly placing layers in front of or behind one another. What's more, it seems it should be fairly straightforward to take parts that get covered up in subsequent frames and others that get revealed to actually build more complete images of parts of a scene that are occasionally covered by other things.
Another way of looking at why motion-based segmentation of this sort is special, consider the fact that it lets something that might otherwise be very hard to segment out using current techniques, such as a child against a graffiti-covered wall, stand out in a striking fashion as it moves in some way different from its background.
Now consider suppression of uninteresting stimuli. It seems in humans that our gaze is generally drawn to rapid or sudden motions in our fields of view. It's easy to see this by just standing around in a field as birds fly about, for instance, or on a busy street, for another. What's more, rapid motions that are unexpected which appear in even the farthest periphery of your visual field are likely to draw your attention away from otherwise static views in front of you. If you wanted to implement this in a computer, it would be pretty easy if the camera were stationary. You simply make it so each pixel slowly gets used to the ambient color and gets painted black. Only pixels that vary dramatically from the ambient color get painted some other color. Then you would use fairly conventional techniques to measure the size and central position of such moving blobs. But what if the camera were in the front windshield watching ahead as you drive? If you could identify the different segments that are moving in their own ways, you could probably fairly quickly get around to ignoring the ambient background. Things like a car changing lanes in front of you or a street sign passing overhead would be more likely to stand out because of their differing relative motions.
I'm in the process of trying to create algorithms to implement this concept of motion-based visual segmentation. To be honest, I'm not having much luck. This may be in part because I haven't much time to devote to it, but it's surely also because it's not easy. So far, I've experimented a little with the idea of searching the entire after image for candidates where a pixel in the before image might have gone in the hopes of narrowing down the possibilities by considering that pixel's neighbors' own candidate locations. Each candidate location would be expressed as an offset vector, which means that neighboring candidates' vectors can easily be compared to see how different they are from one another. When neighboring pixels all move together, they will have identical offset vectors, for instance. I haven't completed such an algorithm, though, because it's not apparent to me that this would be enough without a significant amount of crafty optimization. The number of candidates seems to be quite large, especially if all the pixels in the after image are potential candidates for movement of each pixel in the before image.
One other observation I've made that could have an impact on improving performance is that it seems that most objects that can be segmented out using this technique probably have fairly strongly defined edges around them, any way. Hence, it may make sense to assume that the pixels around one pixel will probably be in the same patch as that one unless they are along edge boundaries. Then, it's up for grabs. Conversely, it may be worthwhile considering only edge pixels' motions. This seems like it would garner more dubious results, but may be faster because it could require consideration of fewer pixels. One related fact is that it should be that the side of an edge which is on the nearer region should remain fairly constant, while the side in the farther region will be changing over time as parts of the background are occluded or revealed. This fact may help in identifying which apparent edges represent actual boundaries between foreground and background regions and particularly in determining which side of the edge is foreground and which background.
I'm encouraged, actually, by a somewhat related technology that may be able to be applied to this problem. I suspect that this same technique is used in our own eyes for binocular vision. That is, the left and right eye images in a given moment are a lot like adjacent frames in an animation: subtly shifted versions of one another. Much hard research has gone into making practical 3D imaging systems that use two or more ordinary video cameras in a radial array around a subject, such as a person sitting in a chair. Although the basic premise seems pretty straightforward, I've often wondered how they actually figure out how a given pixel maps to another one. The constraint of pixel shifting being offset in only the horizontal direction probably helps a lot, but I strongly suspect that people with experience developing these techniques would be well placed to engineer a decent motion-based segmentation algorithm.
One feature that will probably confound the attempt to engineer a good solution is the fact that parts of an image may rotate as well as shift and change apparent size (moving forward and backward). Rotation means that the offset vectors for pixels will not be simply be nearly identical, as in simple shifting. It should mean that the vectors vary subtly from pixel to pixel, rather like a color gradient in a region shifting subtly from green to blue. The same should go for changes in apparent size. The good news is that once the pixels are mapped from before to after frames, the "offset gradients" should tell a lot about the nature of the relative motion. It should, for example, be fairly straightforward to tell if rotation is occurring and find its central axis. And it should be similarly straightforward to tell if the object is apparently getting larger or smaller and hence moving towards or away from the viewer.
I've been experimenting, with limited success, with different ways of finding objects in images using what some vision researchers would call "preattentive" techniques, meaning not involving special knowledge of the nature of the objects to be seen. The work is frustrating in large part because of how confounding real-world images can be to simple analyses and because it's hard to nail down exactly what the goals for a preattentive-level vision system should be. In machine vision circles, this is generally called "segmentation", and usually refers more specifically to segmentation of regions of color, texture, or depth.
Jeff Hawkins (On Intelligence) would say that there's a general-purpose "cortical algorithm" that starts out naive and simply learns to predict how pixel patterns will change from moment to moment. Appealingly simple as that sounds, I find it nearly impossible to square with all I've been learning about the human visual system. From all the literature I've been trying to absorb, it's become quite clear that we still don't know much at all about the mechanisms of human vision. We have a wealth of tidbits of knowledge, but still no comprehensive theory that can be tested by emulation in computers. And it's equally clear nobody in the machine vision realm has found an alternative pathway to general purpose vision, either.
Segmentation seems a practical research goal for now. There has already been quite a bit of research into segmentation based on edges, on smoothly continuous color areas, on textures, and based on binocular disparity. I'm choosing to pursue something I can't seem to find literature on: segmentation of "layers" in animated, three dimensional scenes. Donald D. Hoffman (Visual Intelligence) makes the very strong point that our eyes favor "generic views". If we see two lines meeting at a point in a line drawing, we'll interpret the scene as representing two lines that meet at a point in 3D space, for example. The lines could be interpreted as having their endpoints coincidentally meeting, even though in the Z axis, they may be very far apart, but the concept of generic views says that that sort of coincidence would be so statistically unlikely that we can assume it just doesn't happen.
The principle of generic views seems to apply in animations as well. Picture yourself walking along a path through a park. Things around you are not moving much. Imagine you take a picture once for every step you take in which the center of the picture is always fixed on some distant point and you are keeping the camera level. Later, you study the sequence of pictures. For each pair of adjacent pictures in the sequence, you visually notice that very little seems to change. Yet when you inspect each pixel of the image, a great many of them do change color. You wonder why, but you quickly realize what's happening is that the color of one pixel in the before picture has more or less moved to another location in the after picture. As you study more images in the sequence, you notice a consistent pattern emerging. Near the center point in each image, the pixels don't move very much from frame to frame and the ones farther from the center tend to move in ever larger increments and almost always in a direction that radiates away from the center point.
You're tempted to conclude that you could create a simple algorithm to track the components of sequences captured in this way by simply "smearing" the previous image's pixels outward using a fairly simple mathematical equation based on each pixel's position with respect to the center, but something about the math doesn't seem to work out quite right. With more observation, you notice that trees and rocks alongside the path that are nearer to you than, say, the bushes behind them act a little differently. Their pixels move outward slightly faster than those of the bushes behind them. In fact, the closer an object is to you as you pass it, the faster its pixels seem to morph their way outward. The pixels in the far off hills and sky don't move much at all, for example.
At one point during the walk, you took a 90° left turn in the path and began fixating the camera on a new point. The turn took about 40 frames. In that time, we lost that fixed central point, but the intermediate frames seem to act in the same sort of way. This time, though, instead of smearing radially outward from a central point, the pixels appear to be shoved rapidly to the right of the field of view. It's almost as though we just had a very large bitmap image that we could only see a small rectangle of that was moving over that larger image.
By now, I hope I've impressed on you the idea that in a video stream of typical events in life, much of what is happening from frame to frame is largely a subtle shifting of regions of pixels. Although I've been struggling lately to figure out an effective algorithm to take advantage of this, I am fairly convinced this is likely one of the basic operations that may be going on in our own visual systems. And even if it's not, it seems to be a very valuable technique to employ in pursuit of general purpose machine vision. There seem to be at least two significant benefits that can be gained from application of this principle: segmentation and suppression of uninteresting stimuli.
Consider segmentation. You've probably seen a variant of the "hidden dalmation" image at right here in which the information is ambiguous enough that you have to look rather carefully to grasp what you are looking at. What makes such an illusion all the more fascinating is when it starts out with an even more ambiguous still image that then begins into animation as the dog walks. The dog jumps right out of ambiguity. (Unfortunately, I couldn't find a video of it online to show.) I'm convinced that the reason that the animated version is so much easier to process is that the dog as a whole and its parts move consistently along their own paths from moment to moment as the background moves along its own path and that we see the regions as separate. What's more, I'm confident we also instantly grasp that the dog is in front of the background and not the other way around because we see parts of the background disappearing behind the parts of the dog, which don't get occluded by the background parts.
Motion-based segmentation of this sort seems more computationally complicated than, say, using just edges or color regions, but it carries with it this very powerful value of clearly placing layers in front of or behind one another. What's more, it seems it should be fairly straightforward to take parts that get covered up in subsequent frames and others that get revealed to actually build more complete images of parts of a scene that are occasionally covered by other things.
Another way of looking at why motion-based segmentation of this sort is special, consider the fact that it lets something that might otherwise be very hard to segment out using current techniques, such as a child against a graffiti-covered wall, stand out in a striking fashion as it moves in some way different from its background.
Now consider suppression of uninteresting stimuli. It seems in humans that our gaze is generally drawn to rapid or sudden motions in our fields of view. It's easy to see this by just standing around in a field as birds fly about, for instance, or on a busy street, for another. What's more, rapid motions that are unexpected which appear in even the farthest periphery of your visual field are likely to draw your attention away from otherwise static views in front of you. If you wanted to implement this in a computer, it would be pretty easy if the camera were stationary. You simply make it so each pixel slowly gets used to the ambient color and gets painted black. Only pixels that vary dramatically from the ambient color get painted some other color. Then you would use fairly conventional techniques to measure the size and central position of such moving blobs. But what if the camera were in the front windshield watching ahead as you drive? If you could identify the different segments that are moving in their own ways, you could probably fairly quickly get around to ignoring the ambient background. Things like a car changing lanes in front of you or a street sign passing overhead would be more likely to stand out because of their differing relative motions.
I'm in the process of trying to create algorithms to implement this concept of motion-based visual segmentation. To be honest, I'm not having much luck. This may be in part because I haven't much time to devote to it, but it's surely also because it's not easy. So far, I've experimented a little with the idea of searching the entire after image for candidates where a pixel in the before image might have gone in the hopes of narrowing down the possibilities by considering that pixel's neighbors' own candidate locations. Each candidate location would be expressed as an offset vector, which means that neighboring candidates' vectors can easily be compared to see how different they are from one another. When neighboring pixels all move together, they will have identical offset vectors, for instance. I haven't completed such an algorithm, though, because it's not apparent to me that this would be enough without a significant amount of crafty optimization. The number of candidates seems to be quite large, especially if all the pixels in the after image are potential candidates for movement of each pixel in the before image.
One other observation I've made that could have an impact on improving performance is that it seems that most objects that can be segmented out using this technique probably have fairly strongly defined edges around them, any way. Hence, it may make sense to assume that the pixels around one pixel will probably be in the same patch as that one unless they are along edge boundaries. Then, it's up for grabs. Conversely, it may be worthwhile considering only edge pixels' motions. This seems like it would garner more dubious results, but may be faster because it could require consideration of fewer pixels. One related fact is that it should be that the side of an edge which is on the nearer region should remain fairly constant, while the side in the farther region will be changing over time as parts of the background are occluded or revealed. This fact may help in identifying which apparent edges represent actual boundaries between foreground and background regions and particularly in determining which side of the edge is foreground and which background.
I'm encouraged, actually, by a somewhat related technology that may be able to be applied to this problem. I suspect that this same technique is used in our own eyes for binocular vision. That is, the left and right eye images in a given moment are a lot like adjacent frames in an animation: subtly shifted versions of one another. Much hard research has gone into making practical 3D imaging systems that use two or more ordinary video cameras in a radial array around a subject, such as a person sitting in a chair. Although the basic premise seems pretty straightforward, I've often wondered how they actually figure out how a given pixel maps to another one. The constraint of pixel shifting being offset in only the horizontal direction probably helps a lot, but I strongly suspect that people with experience developing these techniques would be well placed to engineer a decent motion-based segmentation algorithm.
One feature that will probably confound the attempt to engineer a good solution is the fact that parts of an image may rotate as well as shift and change apparent size (moving forward and backward). Rotation means that the offset vectors for pixels will not be simply be nearly identical, as in simple shifting. It should mean that the vectors vary subtly from pixel to pixel, rather like a color gradient in a region shifting subtly from green to blue. The same should go for changes in apparent size. The good news is that once the pixels are mapped from before to after frames, the "offset gradients" should tell a lot about the nature of the relative motion. It should, for example, be fairly straightforward to tell if rotation is occurring and find its central axis. And it should be similarly straightforward to tell if the object is apparently getting larger or smaller and hence moving towards or away from the viewer.
Comments
Post a Comment