Machine vision: pixel morphing
[Audio Version]
Following is another in my series of ad hoc journal entries I've been keeping of my thoughts on machine vision.
I'm entertaining the idea that our vision works using a sort of "pixel morphing" technique. To illustrate what I mean in the context of a computer program, imagine a black scene with a small white dot in it. We'll call this dot the "target". With each frame in time, the circle moves a little, smoothly transcribing a square over the course of, say, forty frames. That means the target is on each of the four edges for ten time steps.
The target starts at the top left corner and travels rightward. The agent watching this should be able to infer that the dot seen in the first frame is the same as in the second frame, even though it has moved, say, 50 pixels away. Let's take this "magic" step as a given. The agent hence infers that the target is moving at a rate of 50 pixels per step. In the third frame, it expects the target to be 50 pixels further to the right and looks for it there.
Eventually, the target reaches the right edge of the square and starts traversing downward along that edge. Our agent is expecting the target to be 50 pixels to the right in the next step and so looks for it there. It doesn't find it. Using an assumption that things don't usually just appear and disappear from view, the agent looks around for the target until it finds it. It now has a new estimate of where it will be in the next frame: 50 pixels below its position in the current frame.
Now, since the target is the only thing breaking up the black backdrop, it leaves something ambiguous. Is the target moving or is the scene moving, as might happen if a robot were falling over? We'll prefer to assume the entire scene is moving because there's nothing to suggest otherwise. So now let's draw a solid brown square around the invisible square the target traverses. The result looks like a white ball moving around inside a brown box. Starting from the first frame, again, we magically notice the target has moved 50 pixels to the right in the second frame. The brown square has not moved. We could interpret this as the ball moving in a stationary scene or as a moving scene with the brown box moving so as to perfectly offset the scene's motion. This latter interpretation seems absurd, so we conclude the target is what is moving. Incidentally, this helps explain why one prefers to think of the world as moving while the train he is on is "stationary". The train provides the main frame of reference in the scene, unless one presses his face against the window and so only sees the outside scene.
Now to explain the magic step. The target's position has changed from frame N to frame N + 1. It's now 50 pixels to the right. How can we programmatically infer that these two things are the same? For the second frame, we don't have any way to predict that the target will be 50 pixels to the right of its original position. What should happen is that the agent should assume that objects don't just disappear. Seeing the circle is missing, it should go looking for it. One approach might be to note that now there's a "new" blob in the scene that wasn't there before. The two are roughly the same size and color, so it seems reasonable to assume they are the same and to go from there. It becomes a collaboration between the interpretations of two separate frames.
But there's still some magic behind this approach. We simplified the world by just having flat-colored objects like the brown rectangle and the white circle. There are no shadows or other lighting effects and only a small number of objects to deal with. The magic part is that we were able to identify objects before we bothered to match them up. A vision system should probably be able to do matching up of parts of a picture before recognizing objects. But how?
One answer might be to treat a single frame as though it were made of rubber. Each pixel in it can be pushed in some direction, but one should expect the neighboring pixels will move in similar directions and distances, with that expectation falling off with distance from any given pixel that moves. Imagine a picture of a 3D cube with each side a different color rotating along an up/down axis, for example. You see the top of the cube and the sides nearest you. And the lighting is such that the front faces change color subtly as the cube rotates. Looking down the axis, the cube is rotating clockwise, which means you see the front faces moving from right to left.
Imagine the pixels around the top corner nearest you. You see color from three faces: the two front faces and the top face. Let's talk about frames 1 and 2, where 2 comes is right after frame 1. In frame 1 the corner we're looking at is a little off to the right of the center of the frame and in frame 1, it's a little left of the center. We want the agent considering these frames to intuit from frames 1 and 2 that the corner under consideration has moved and where it is. Now think of frame 1 as a picture made of rubber. Imagine stretching it so that it looks like frame 2. With your finger, you push the corner we're considering an inch to the left so it lines up with the same corner in frame 2. Other pixels nearby go with it. Now you do the same with the bottom corner just below it and it's starting to look a little more like frame 2. You do the same along the edge between these two corners until the edge is pretty straight and lines up the same edge in frame 2. And you do the same with each of the edges and corners you find in the image.
Interestingly, you can do this with frame 3, too. You can keep doing this with each frame, but eventually things "break". The left front face eventually is rotated out of view. All those pixels in that face can't be pushed anywhere that will fit. They have to be cut out and the gap closed, somehow. Likewise, a new face eventually appears on the right, and there has to be a way to explain its appearance. Still much of the scene is pretty stable in this model. Most pixels are just being pushed around.
How would such a mechanism be implemented in code? When the color of a pixel changes, the agent can't just look randomly for another pixel in the image that is the same color and claim it's the same one. Even the closest match might not be the same one. But what if each pixel were treated as a free agent that has to bargain with its nearby neighbors to come up with a consistent explanation that would, collectively, result in morphing of the sort described above? Strength in numbers would matter. Those pixels whose colors don't change would largely be non-participants. Only those that change from one frame to another would. From frame 1 to 2, pixel P changes color. In frame 1, pixel P was in color blob B1. In frame 2, P searches all the color blobs for the one whose center is closest to P that is strongly similar in color. It tries to optimize its choice on closeness in both distance and color. In the meantime, every other P that changes from frame 1 to 2 is doing the same. When it's all done, every changed-pixel P is reconsidered by reference to its neighbors. What to do next is not clear, though.
One thing that should come out of the collaborative process, though, is a kind of optimization. Once some pixels in the image have been solidly matched by morphing, they should give helpful hints to nearby pixels as to where to begin their searches as well. If pixel P has moved 50 pixels to the right and 10 down, the pixel next to P has probably also moved about 50 pixels to the right and 10 down.
In the case of the white circle moving around, it should be clear. But what if a white border were added around the brown square? The brown hole created as the white circle moves from that position to the next might result in all changed pixels P guessing that the white pixels in the border nearest the new brown hole are actually where the white circle went, but this doesn't make sense. Similarly, the new white circle in frame 2 could be thought to have come from out of the white border; again, this doesn't make sense.
One answer would be a sort of conservation of "mass" concept, where mass is the number of pixels in some color blob. The white circle in frame 2 could have come from the wall, but that would require creating a bunch of new white pixels. And the white circle in frame 1 could have disappeared into the white border, but this would require a complete loss of those pixels. Perhaps the very fact that we have a mass of pixels in one place in frame 1 and the same mass of the same color of pixels in another place in frame 2 should lead us to conclude that they are the same.
There's a lot of ground to cover with this concept. I think there's value to the idea of bitmap morphing. I think a great illustration of how this could be used by our own vision is how we deal with driving. Looking forward, the whole scene is constantly changing, but only subtly. Only the occasional bird darting by or other fast-moving objects screw up the impression of a single image that's subtly morphing.
Following is another in my series of ad hoc journal entries I've been keeping of my thoughts on machine vision.
I'm entertaining the idea that our vision works using a sort of "pixel morphing" technique. To illustrate what I mean in the context of a computer program, imagine a black scene with a small white dot in it. We'll call this dot the "target". With each frame in time, the circle moves a little, smoothly transcribing a square over the course of, say, forty frames. That means the target is on each of the four edges for ten time steps.
The target starts at the top left corner and travels rightward. The agent watching this should be able to infer that the dot seen in the first frame is the same as in the second frame, even though it has moved, say, 50 pixels away. Let's take this "magic" step as a given. The agent hence infers that the target is moving at a rate of 50 pixels per step. In the third frame, it expects the target to be 50 pixels further to the right and looks for it there.
Eventually, the target reaches the right edge of the square and starts traversing downward along that edge. Our agent is expecting the target to be 50 pixels to the right in the next step and so looks for it there. It doesn't find it. Using an assumption that things don't usually just appear and disappear from view, the agent looks around for the target until it finds it. It now has a new estimate of where it will be in the next frame: 50 pixels below its position in the current frame.
Now, since the target is the only thing breaking up the black backdrop, it leaves something ambiguous. Is the target moving or is the scene moving, as might happen if a robot were falling over? We'll prefer to assume the entire scene is moving because there's nothing to suggest otherwise. So now let's draw a solid brown square around the invisible square the target traverses. The result looks like a white ball moving around inside a brown box. Starting from the first frame, again, we magically notice the target has moved 50 pixels to the right in the second frame. The brown square has not moved. We could interpret this as the ball moving in a stationary scene or as a moving scene with the brown box moving so as to perfectly offset the scene's motion. This latter interpretation seems absurd, so we conclude the target is what is moving. Incidentally, this helps explain why one prefers to think of the world as moving while the train he is on is "stationary". The train provides the main frame of reference in the scene, unless one presses his face against the window and so only sees the outside scene.
Now to explain the magic step. The target's position has changed from frame N to frame N + 1. It's now 50 pixels to the right. How can we programmatically infer that these two things are the same? For the second frame, we don't have any way to predict that the target will be 50 pixels to the right of its original position. What should happen is that the agent should assume that objects don't just disappear. Seeing the circle is missing, it should go looking for it. One approach might be to note that now there's a "new" blob in the scene that wasn't there before. The two are roughly the same size and color, so it seems reasonable to assume they are the same and to go from there. It becomes a collaboration between the interpretations of two separate frames.
But there's still some magic behind this approach. We simplified the world by just having flat-colored objects like the brown rectangle and the white circle. There are no shadows or other lighting effects and only a small number of objects to deal with. The magic part is that we were able to identify objects before we bothered to match them up. A vision system should probably be able to do matching up of parts of a picture before recognizing objects. But how?
One answer might be to treat a single frame as though it were made of rubber. Each pixel in it can be pushed in some direction, but one should expect the neighboring pixels will move in similar directions and distances, with that expectation falling off with distance from any given pixel that moves. Imagine a picture of a 3D cube with each side a different color rotating along an up/down axis, for example. You see the top of the cube and the sides nearest you. And the lighting is such that the front faces change color subtly as the cube rotates. Looking down the axis, the cube is rotating clockwise, which means you see the front faces moving from right to left.
Imagine the pixels around the top corner nearest you. You see color from three faces: the two front faces and the top face. Let's talk about frames 1 and 2, where 2 comes is right after frame 1. In frame 1 the corner we're looking at is a little off to the right of the center of the frame and in frame 1, it's a little left of the center. We want the agent considering these frames to intuit from frames 1 and 2 that the corner under consideration has moved and where it is. Now think of frame 1 as a picture made of rubber. Imagine stretching it so that it looks like frame 2. With your finger, you push the corner we're considering an inch to the left so it lines up with the same corner in frame 2. Other pixels nearby go with it. Now you do the same with the bottom corner just below it and it's starting to look a little more like frame 2. You do the same along the edge between these two corners until the edge is pretty straight and lines up the same edge in frame 2. And you do the same with each of the edges and corners you find in the image.
Interestingly, you can do this with frame 3, too. You can keep doing this with each frame, but eventually things "break". The left front face eventually is rotated out of view. All those pixels in that face can't be pushed anywhere that will fit. They have to be cut out and the gap closed, somehow. Likewise, a new face eventually appears on the right, and there has to be a way to explain its appearance. Still much of the scene is pretty stable in this model. Most pixels are just being pushed around.
How would such a mechanism be implemented in code? When the color of a pixel changes, the agent can't just look randomly for another pixel in the image that is the same color and claim it's the same one. Even the closest match might not be the same one. But what if each pixel were treated as a free agent that has to bargain with its nearby neighbors to come up with a consistent explanation that would, collectively, result in morphing of the sort described above? Strength in numbers would matter. Those pixels whose colors don't change would largely be non-participants. Only those that change from one frame to another would. From frame 1 to 2, pixel P changes color. In frame 1, pixel P was in color blob B1. In frame 2, P searches all the color blobs for the one whose center is closest to P that is strongly similar in color. It tries to optimize its choice on closeness in both distance and color. In the meantime, every other P that changes from frame 1 to 2 is doing the same. When it's all done, every changed-pixel P is reconsidered by reference to its neighbors. What to do next is not clear, though.
One thing that should come out of the collaborative process, though, is a kind of optimization. Once some pixels in the image have been solidly matched by morphing, they should give helpful hints to nearby pixels as to where to begin their searches as well. If pixel P has moved 50 pixels to the right and 10 down, the pixel next to P has probably also moved about 50 pixels to the right and 10 down.
In the case of the white circle moving around, it should be clear. But what if a white border were added around the brown square? The brown hole created as the white circle moves from that position to the next might result in all changed pixels P guessing that the white pixels in the border nearest the new brown hole are actually where the white circle went, but this doesn't make sense. Similarly, the new white circle in frame 2 could be thought to have come from out of the white border; again, this doesn't make sense.
One answer would be a sort of conservation of "mass" concept, where mass is the number of pixels in some color blob. The white circle in frame 2 could have come from the wall, but that would require creating a bunch of new white pixels. And the white circle in frame 1 could have disappeared into the white border, but this would require a complete loss of those pixels. Perhaps the very fact that we have a mass of pixels in one place in frame 1 and the same mass of the same color of pixels in another place in frame 2 should lead us to conclude that they are the same.
There's a lot of ground to cover with this concept. I think there's value to the idea of bitmap morphing. I think a great illustration of how this could be used by our own vision is how we deal with driving. Looking forward, the whole scene is constantly changing, but only subtly. Only the occasional bird darting by or other fast-moving objects screw up the impression of a single image that's subtly morphing.
Comments
Post a Comment