Patch mapping in video

Over the weekend, I had one of them epiphany thingies. Sometime last week, I had started up a new vision project involving patch matching. In the past, I've explored this idea with stereo vision and discovering textures. Also, I opined a bit on motion-based segmentation here a couple of years ago.

My goal in this new experiment was fairly modest: plant a point of interest (POI) on a video scene and see how well the program can track that POI from frame to frame. I took a snippet of a music video and captured 55 frames into separate JPEG files and made a simple engine with a Sequence class to cache the video frames in memory and a PointOfInterest class, of which the Sequence object would have a list, all busy following POIs. The algorithm for finding the same patch in the next frame is really simple and only involves summing up the red, green, and blue pixel value differences in candidate patches and accepting the candidate with the lowest difference total; trivial, really. When I ran the algorithm with a carefully picked POI, I was stunned at how well it worked on the first try. I experimented with various POIs and different parameters and got a good sense for its limits and potentials. It got me really thinking a lot about how far this idea can be taken, though. Following is a sample video that illustrates what I experimented with. I explain more below. You may want to stop the video here and open it in a separate media player while you read on in the text.

Click here to open this WMV file

I specifically wanted to show both the bad and the good of my algorithm with the above video. After I played a lot with hand-selected POIs, I let the program pick POIs based on how "sharp" regions in the image are. I was impressed at how my simple algorithm for that worked, too. As you can see, in the first frame, 20 POIs (green squares) are found at some fairly high contrast parts of the image, like the runner's neck and the boulders near the horizon. As you watch the video loop, start by watching how well the POIs on the right brilliantly follow with the video. The ones that start on the runner quickly go all over the place and "die" because they can no longer find their intended targets. Note the POIs in the rocks that get obscured by the runner's arm, though. They flash red as the arm goes by, but they pick up again as the arm uncovers them. Once a POI loses its target, it gets 3 more frames to try, during which it continues forward in the same velocity as before, and then it dies if it doesn't pick it up again. Once the man's leg covers these POIs, you can see them fly off in a vain search for where the POIs might be going before they die.

I don't want to go into all the details of this particular program because I intend to take this to the next logical level and will make code available for that. I thought it useful just to show a cute video and perhaps mark this as a starting point with much bigger potential.

Although I thought of a bunch of ways in which I could use this, I want to indicate one in particular. First, my general goal in AI these days is generally to engender what I refer to as "perceptual level intelligence". I want to make it so machines can meaningfully and generally perceive the world. In this case, I'd like to build up software that can construct a 2D-ish perception of the contents of a video stream. My view is that typical real video contains enough information to discern foreground from background and whole objects and their parts as though they were layers drawn separately and layered together, as with an old fashioned cel-type animation. In fact, I think it's possible to do this without meaningfully recognizing the objects as people, rocks, etc.

I propose filling the first frame of a video with POI trackers like the ones in this video. The ones that have clearly distinguished targets would act like anchor points. Other neighbors that would be in more ambiguous areas -- like the sky or gravel in this example -- would rely more on those anchors, but would also "talk" to their neighbors to help correct themselves when errors creep in. In fact, it should be possible for POIs that become obscured by foreground objects to continue to be projected forward. In the example above, it should actually be possible, then, to take the resulting patches that are tagged as belonging to the background and actually reproduce a new video that does not include the runner! And then another video that, by subtracting out the established background, contains only the runner. This would be a good demonstration of segmenting background and foreground.

It should also be possible for these POIs to get better and better at predicting where they will go by introducing certain learning algorithms. In fact, it's possible the POI algorithm could actually start off naive and come to learn how to properly behave on its own.

The key to both this latter dramatic feat and the other earlier goals is an idea I gleaned from Donald D. Hoffman's Visual Intelligence. One idea he promotes repeatedly in this book is the importance of "stable" interpretations of visual scenes. His book dealt primarily in static images, but this idea is powerful. Here's an example of what I mean. Watch the gravel in the video above. Naturally, gravel that is lower in the video is closer to you and thus slides by faster than the gravel higher up and thus farther away. Ideally, POI patches following this gravel would move smoothly so that higher up levels would slide slowly and lower down would slide more quickly. (To be sure, this video would have to be normalized to correct for the camera being so jumpy.) If one patch in this "stream" of flow were to think it should suddenly jut up several pixels while its neighbors are all slowly drifting to the left, this would not seem to fit a "stable" interpretation of this one patch being part of a larger whole or of it following a smooth path at a fairly consistent pace. We assume the world rarely has sudden changes and thus prefer these smooth continuations.

In chapter 6 of Visual Intelligence, Hoffman addresses motion specifically and, while he doesn't talk about patch processing like this, does introduce a bunch of interesting rules for perception. Here are some of them that relate here:

  • Rule 29. Create the simplest possible motions.
  • Rule 30. When making motion, construct as few objects as possible, and conserve them as much as possible.
  • Rule 31. Construct motion to be as uniform over space as possible.
  • Rule 32. Construct the smoothest velocity field.

The idea of stable interpretations can come into play with POIs that are following boundaries of foreground objects, like the runner in this example. My POIs failed to follow in part because, while the "inside" part of the patch was associated with the man's head, for example, the "outside" would be associated with the background, which might be constantly changing as the head moves forward in space. In fact, the "outside" (background) part of such a POI should generally be "unstable", while the "inside" (foreground) stays stable. That assumption of instability of background as it constantly is obscured or uncovered by the foreground is a rule that should be helpful both in getting POIs to track these edges, but also in detecting these edges in the first place and thus segmenting foreground objects from background ones.

As far as patches learning how to make predictions autonomously, here's where the concept of stable interpretations really shines. The goal of the learning process should be to make a POI algorithm that forms the most stable interpretations of the world. Therefore, when comparing two possible algorithmic changes -- perhaps using a genetic algorithm -- the fitness function would be stability itself. That is, the fitness function would measure the fidelity of the matches, how well each POI sticks with its neighbors, how well it finds foreground / background interfaces (against human-defined standards, perhaps), and so on.

There's so much more that could be said on this topic, but my blogging hand needs a break.


Popular posts from this blog

Neural network in C# with multicore parallelization / MNIST digits demo

Discovering English syntax

Virtual lexicon vs Brown corpus