Search This Blog

Wednesday, July 4, 2007

Plan for video patch analysis study

I've done a lot of thinking about this idea of making a program that can characterize the motions of all parts of a video scene. Not surprisingly, I've concluded it's going to be a hard problem. But unlike other cases where I've smacked up against a brick wall, I can see what seems a clear path from here to there. It's just going to take a long time and a lot of steps. Here's an overview of my plan.

First, the goal. The most basic purpose is to, as I said above, make a program that can characterize the motions of all parts of a video scene. The program should be able to fill an entire scene with "patches". Each patch will lock onto the content found in that frame and follow it throughout the video or until it can no longer be tracked. So if one patch is planted over the eye of a person walking through the scene, the patch should be able to follow that eye for at least as long as it's visible. Achieving this goal will be valuable because it will provide a sort of representation of the contents of the scene as fluidly moving but persistent objects. This seems a cornerstone of generalized visual perception, which has been entirely lacking in the history of AI research.

One key principle for all of this research will be the goal of constructing stable, generic views, elaborated by Donald D. Hoffman inVisual Intelligence. The dynamics of individual patches will be very ambiguous. Favoring stable interpretations of the world will help patches to make smarter guesses, especially when some lines of evidence strongly suggest non-stable ones.

One obvious challenge is when a patch falls on a linear edge, like the side of a house, instead of a sharp point, like a roof peak. Even more challenging will be patches that fall on homogenous textures, like grass, where independent tracking will be very difficult. It seems clear that an important key to the success of any single patch tracking its subject matter will be cooperating with its neighboring patches to get clues about what its own motion should be. Patches that follow sharp corners will have a high degree of confidence in their ability to follow their target content. Patches that follow edges will be less certain and will rely on higher confidence patches nearby to help them make good guesses. Patches that follow homogeneous textures will have very low confidence and will rely almost exclusively on higher confidence patches nearby to make reasonable guesses about how to follow their target content.

The algorithms for getting patches to cooperate will be a big challenge as it is. If the patches themselves aren't any good at following even strong points of interest, working on fabrics of patches will be a waste of time. Before any significant amount of time is spent on patch fabrics, I intend to focus attention on individual patches. A patch should be able to at least follow sharp points of interest. It should also be able to follow smooth edges laterally along the edge, like a buoy bobbing on water. Even this is a difficult challenge, though. Video of 3D scenes will include objects that move toward and away from the camera, so individual patches' target contents will sometimes shrink or expand. Nearby points of interest that look similar can confuse a patch if the target content is moving a lot. Changes in lighting and shadow from overcast trees, rotation, and so on will pose a huge challenge. Some of the strongest points of interest lie on outer edges of 3D objects. As such an object moves against its background, part of the patch's pattern will naturally change. The patch needs to be able to detect its content as an object edge and learn quickly to ignore the background movements.

It's apparent that solving each of these problems will require a lot of thought, coding, and testing. Also, that these components may well work against each other. It's going to be important for the patch to be able to arbitrate differing opinions among the components about where to go with each moment. How best to arbitrate is a mystery to me at present. It seems logical, then, to begin my study by creating and testing the various analysis components of a single patch.

Once I have some better definition of the analysis tools a patch will have at its disposal for independent behavior, I should then have a tool kit of black-boxes that an arbitration (and probably learning) algorithm can work with. Once I have a patch component that can do many analyses and come up with good guesses about the dynamics of its target content, then I can move on to constructing "fabrics" of patches so the patches can rely on their neighbors for additional evidence. The individual patches, if they have a generic arbitration mechanism, can use additional information from neighbors as just more evidence to arbitrate with.

I have made a conscious choice this time not to worry about performance. If it takes a day to analyze a single frame of a video, that's fine. *shudder* Well, I probably will try to at least make my research tolerable, but the result of this will almost certainly not be practical for real-time processing of video using the equipment I have on hand. However, I believe that if I am successful at least in proving the concept I'm striving for and thus advancing research into visual perception in machines, other programmers will pick apart the algorithms and reproduce them in more efficient ways. Further, it is very clear to me that individual patches are so wonderfully self-contained that it will be possible to divvy out all the patches in a scene to as many processors as we can throw at the problem. This means that if one can make a patch fabric engine that processes one frame per second using a single processor, it should be fairly easy to make it process 30 frames per second with 30 processors.

I am also dispensing somewhat with the goal of mimicking human vision with this project. I do believe a lot of what I'm trying to do does go on in our visual systems. I don't have strong reason to believe, though, that we have little parts of our brains devoted to following patches wherever they will go as time passes. That doesn't seem to fit the fixed wiring of our brains very well. It may well be that we do patch following of a sort that lets the patch slide from neural patch to neural patch, which may imply some means of passing state information along those internal paths. I can hypothesize about that, but really, I don't know enough yet to say that this is literally what happens in the human visual system. I think it's enough to say that it could.

So that's my current plan of research for a while. I have to do this in such small bites that it's going to be a challenge keeping momentum. I just hope that I've broken the project up into small enough bites to make significant progress over the longer term.

Sunday, July 1, 2007

Patch mapping in video

Over the weekend, I had one of them epiphany thingies. Sometime last week, I had started up a new vision project involving patch matching. In the past, I've explored this idea with stereo vision and discovering textures. Also, I opined a bit on motion-based segmentation here a couple of years ago.

My goal in this new experiment was fairly modest: plant a point of interest (POI) on a video scene and see how well the program can track that POI from frame to frame. I took a snippet of a music video and captured 55 frames into separate JPEG files and made a simple engine with a Sequence class to cache the video frames in memory and a PointOfInterest class, of which the Sequence object would have a list, all busy following POIs. The algorithm for finding the same patch in the next frame is really simple and only involves summing up the red, green, and blue pixel value differences in candidate patches and accepting the candidate with the lowest difference total; trivial, really. When I ran the algorithm with a carefully picked POI, I was stunned at how well it worked on the first try. I experimented with various POIs and different parameters and got a good sense for its limits and potentials. It got me really thinking a lot about how far this idea can be taken, though. Following is a sample video that illustrates what I experimented with. I explain more below. You may want to stop the video here and open it in a separate media player while you read on in the text.

Click here to open this WMV file

I specifically wanted to show both the bad and the good of my algorithm with the above video. After I played a lot with hand-selected POIs, I let the program pick POIs based on how "sharp" regions in the image are. I was impressed at how my simple algorithm for that worked, too. As you can see, in the first frame, 20 POIs (green squares) are found at some fairly high contrast parts of the image, like the runner's neck and the boulders near the horizon. As you watch the video loop, start by watching how well the POIs on the right brilliantly follow with the video. The ones that start on the runner quickly go all over the place and "die" because they can no longer find their intended targets. Note the POIs in the rocks that get obscured by the runner's arm, though. They flash red as the arm goes by, but they pick up again as the arm uncovers them. Once a POI loses its target, it gets 3 more frames to try, during which it continues forward in the same velocity as before, and then it dies if it doesn't pick it up again. Once the man's leg covers these POIs, you can see them fly off in a vain search for where the POIs might be going before they die.

I don't want to go into all the details of this particular program because I intend to take this to the next logical level and will make code available for that. I thought it useful just to show a cute video and perhaps mark this as a starting point with much bigger potential.

Although I thought of a bunch of ways in which I could use this, I want to indicate one in particular. First, my general goal in AI these days is generally to engender what I refer to as "perceptual level intelligence". I want to make it so machines can meaningfully and generally perceive the world. In this case, I'd like to build up software that can construct a 2D-ish perception of the contents of a video stream. My view is that typical real video contains enough information to discern foreground from background and whole objects and their parts as though they were layers drawn separately and layered together, as with an old fashioned cel-type animation. In fact, I think it's possible to do this without meaningfully recognizing the objects as people, rocks, etc.

I propose filling the first frame of a video with POI trackers like the ones in this video. The ones that have clearly distinguished targets would act like anchor points. Other neighbors that would be in more ambiguous areas -- like the sky or gravel in this example -- would rely more on those anchors, but would also "talk" to their neighbors to help correct themselves when errors creep in. In fact, it should be possible for POIs that become obscured by foreground objects to continue to be projected forward. In the example above, it should actually be possible, then, to take the resulting patches that are tagged as belonging to the background and actually reproduce a new video that does not include the runner! And then another video that, by subtracting out the established background, contains only the runner. This would be a good demonstration of segmenting background and foreground.

It should also be possible for these POIs to get better and better at predicting where they will go by introducing certain learning algorithms. In fact, it's possible the POI algorithm could actually start off naive and come to learn how to properly behave on its own.

The key to both this latter dramatic feat and the other earlier goals is an idea I gleaned from Donald D. Hoffman's Visual Intelligence. One idea he promotes repeatedly in this book is the importance of "stable" interpretations of visual scenes. His book dealt primarily in static images, but this idea is powerful. Here's an example of what I mean. Watch the gravel in the video above. Naturally, gravel that is lower in the video is closer to you and thus slides by faster than the gravel higher up and thus farther away. Ideally, POI patches following this gravel would move smoothly so that higher up levels would slide slowly and lower down would slide more quickly. (To be sure, this video would have to be normalized to correct for the camera being so jumpy.) If one patch in this "stream" of flow were to think it should suddenly jut up several pixels while its neighbors are all slowly drifting to the left, this would not seem to fit a "stable" interpretation of this one patch being part of a larger whole or of it following a smooth path at a fairly consistent pace. We assume the world rarely has sudden changes and thus prefer these smooth continuations.

In chapter 6 of Visual Intelligence, Hoffman addresses motion specifically and, while he doesn't talk about patch processing like this, does introduce a bunch of interesting rules for perception. Here are some of them that relate here:

  • Rule 29. Create the simplest possible motions.
  • Rule 30. When making motion, construct as few objects as possible, and conserve them as much as possible.
  • Rule 31. Construct motion to be as uniform over space as possible.
  • Rule 32. Construct the smoothest velocity field.

The idea of stable interpretations can come into play with POIs that are following boundaries of foreground objects, like the runner in this example. My POIs failed to follow in part because, while the "inside" part of the patch was associated with the man's head, for example, the "outside" would be associated with the background, which might be constantly changing as the head moves forward in space. In fact, the "outside" (background) part of such a POI should generally be "unstable", while the "inside" (foreground) stays stable. That assumption of instability of background as it constantly is obscured or uncovered by the foreground is a rule that should be helpful both in getting POIs to track these edges, but also in detecting these edges in the first place and thus segmenting foreground objects from background ones.

As far as patches learning how to make predictions autonomously, here's where the concept of stable interpretations really shines. The goal of the learning process should be to make a POI algorithm that forms the most stable interpretations of the world. Therefore, when comparing two possible algorithmic changes -- perhaps using a genetic algorithm -- the fitness function would be stability itself. That is, the fitness function would measure the fidelity of the matches, how well each POI sticks with its neighbors, how well it finds foreground / background interfaces (against human-defined standards, perhaps), and so on.

There's so much more that could be said on this topic, but my blogging hand needs a break.