Plan for video patch analysis study

I've done a lot of thinking about this idea of making a program that can characterize the motions of all parts of a video scene. Not surprisingly, I've concluded it's going to be a hard problem. But unlike other cases where I've smacked up against a brick wall, I can see what seems a clear path from here to there. It's just going to take a long time and a lot of steps. Here's an overview of my plan.

First, the goal. The most basic purpose is to, as I said above, make a program that can characterize the motions of all parts of a video scene. The program should be able to fill an entire scene with "patches". Each patch will lock onto the content found in that frame and follow it throughout the video or until it can no longer be tracked. So if one patch is planted over the eye of a person walking through the scene, the patch should be able to follow that eye for at least as long as it's visible. Achieving this goal will be valuable because it will provide a sort of representation of the contents of the scene as fluidly moving but persistent objects. This seems a cornerstone of generalized visual perception, which has been entirely lacking in the history of AI research.

One key principle for all of this research will be the goal of constructing stable, generic views, elaborated by Donald D. Hoffman inVisual Intelligence. The dynamics of individual patches will be very ambiguous. Favoring stable interpretations of the world will help patches to make smarter guesses, especially when some lines of evidence strongly suggest non-stable ones.

One obvious challenge is when a patch falls on a linear edge, like the side of a house, instead of a sharp point, like a roof peak. Even more challenging will be patches that fall on homogenous textures, like grass, where independent tracking will be very difficult. It seems clear that an important key to the success of any single patch tracking its subject matter will be cooperating with its neighboring patches to get clues about what its own motion should be. Patches that follow sharp corners will have a high degree of confidence in their ability to follow their target content. Patches that follow edges will be less certain and will rely on higher confidence patches nearby to help them make good guesses. Patches that follow homogeneous textures will have very low confidence and will rely almost exclusively on higher confidence patches nearby to make reasonable guesses about how to follow their target content.

The algorithms for getting patches to cooperate will be a big challenge as it is. If the patches themselves aren't any good at following even strong points of interest, working on fabrics of patches will be a waste of time. Before any significant amount of time is spent on patch fabrics, I intend to focus attention on individual patches. A patch should be able to at least follow sharp points of interest. It should also be able to follow smooth edges laterally along the edge, like a buoy bobbing on water. Even this is a difficult challenge, though. Video of 3D scenes will include objects that move toward and away from the camera, so individual patches' target contents will sometimes shrink or expand. Nearby points of interest that look similar can confuse a patch if the target content is moving a lot. Changes in lighting and shadow from overcast trees, rotation, and so on will pose a huge challenge. Some of the strongest points of interest lie on outer edges of 3D objects. As such an object moves against its background, part of the patch's pattern will naturally change. The patch needs to be able to detect its content as an object edge and learn quickly to ignore the background movements.

It's apparent that solving each of these problems will require a lot of thought, coding, and testing. Also, that these components may well work against each other. It's going to be important for the patch to be able to arbitrate differing opinions among the components about where to go with each moment. How best to arbitrate is a mystery to me at present. It seems logical, then, to begin my study by creating and testing the various analysis components of a single patch.

Once I have some better definition of the analysis tools a patch will have at its disposal for independent behavior, I should then have a tool kit of black-boxes that an arbitration (and probably learning) algorithm can work with. Once I have a patch component that can do many analyses and come up with good guesses about the dynamics of its target content, then I can move on to constructing "fabrics" of patches so the patches can rely on their neighbors for additional evidence. The individual patches, if they have a generic arbitration mechanism, can use additional information from neighbors as just more evidence to arbitrate with.

I have made a conscious choice this time not to worry about performance. If it takes a day to analyze a single frame of a video, that's fine. *shudder* Well, I probably will try to at least make my research tolerable, but the result of this will almost certainly not be practical for real-time processing of video using the equipment I have on hand. However, I believe that if I am successful at least in proving the concept I'm striving for and thus advancing research into visual perception in machines, other programmers will pick apart the algorithms and reproduce them in more efficient ways. Further, it is very clear to me that individual patches are so wonderfully self-contained that it will be possible to divvy out all the patches in a scene to as many processors as we can throw at the problem. This means that if one can make a patch fabric engine that processes one frame per second using a single processor, it should be fairly easy to make it process 30 frames per second with 30 processors.

I am also dispensing somewhat with the goal of mimicking human vision with this project. I do believe a lot of what I'm trying to do does go on in our visual systems. I don't have strong reason to believe, though, that we have little parts of our brains devoted to following patches wherever they will go as time passes. That doesn't seem to fit the fixed wiring of our brains very well. It may well be that we do patch following of a sort that lets the patch slide from neural patch to neural patch, which may imply some means of passing state information along those internal paths. I can hypothesize about that, but really, I don't know enough yet to say that this is literally what happens in the human visual system. I think it's enough to say that it could.

So that's my current plan of research for a while. I have to do this in such small bites that it's going to be a challenge keeping momentum. I just hope that I've broken the project up into small enough bites to make significant progress over the longer term.


Popular posts from this blog

Neural network in C# with multicore parallelization / MNIST digits demo

Discovering English syntax

Virtual lexicon vs Brown corpus