Video stabilizer

I haven't had much chance to do coding for my AI research of late. My most recent experiment dealt more with patch matching in video streams. Here's a source video, taken from a hot air balloon, with a run of what I'll call a "video stabilizer" applied:

Full video with "follower" frame.
Click here to open this WMV file

Contents of the follower frame.
Click here to open this WMV file

The colored "follower" frame in the left video does its best to lock onto the subject it first sees when it appears. As the follower moves off center, a new frame is created in the center to take over. The right video is of the contents of the colored frame. (If the two videos appear out of sync, try refreshing this page once the videos are totally loaded.)

This algorithm does a surprisingly good job of tracking the ambient movement in this particular video. That was the point, though. I wondered how well a visual system could learn to identify stable patterns in a video if the video was not stable in the first place. I reasoned that an algorithm like this could help a machine vision system to make the world a little more stable for second level processing of source video.

The algorithm for this feat is unbelievably simple. I have a code class representing a single "follower" object. A follower has a center point, relative to the source video, and a width and height. We'll call this a "patch" of the video frame. With each passing frame, it does a bit-level comparison of what's inside the current patch against the contents of the next video frame, in search of a good match.

For each patch considered in the next frame, a difference calculation is performed, which is very simple. For each pixel in the two corresponding patches (current-frame and next-frame) under consideration, the difference in the red, green, and blue values are added to a running difference total. The candidate patch that has the lowest total difference is considered the best match and is thus where the follower goes in this next frame. Here's the code for comparing the current patch against a candidate patch in the next frame:

private int CompareRegions(int OffsetX, int OffsetY) {
int X, Y, Diff;
Color A, B;

const int ScanSpacing = 10;

Diff = 0;

for (Y = CenterY - RadiusY; Y <= CenterY + RadiusY; Y += ScanSpacing) {
for (X = CenterX - RadiusX; X <= CenterX + RadiusX; X += ScanSpacing) {
A = GetPixel(CurrentBmp, X, Y);
B = GetPixel(NextBmp, X + OffsetX, Y + OffsetY);
Diff +=
Math.Abs(A.R - B.R) +
Math.Abs(A.G - B.G) +
Math.Abs(A.B - B.B);

return Diff;

Assuming the above gibberish makes any sense, you may notice "Y += ScanSpacing" and the same for X. That's an optimization. In fact, the program does include a number of performance optimizations that help make the run-time on these processes more bearable. First, a follower doesn't consider all possible patches in the next frame to decide where to move. It only considers patches within a certain radius of the current location. OffsetX, for example, may only be +/- 50 pixels, which means if the subject matter in the video slides horizontally more than 50 pixels between frames, the algorithm won't work right. Still, this can increase frame processing rates 10-fold, with smaller search radii yielding shorter run-times.

As for "Y += ScanSpacing", that was a shot in the dark for me. I was finding frame processing was taking a very long time, still. So I figured, why not skip every Nth pixel in the patches during the patch comparison operation? I was surprised to find that even with ScanSpacing of 10 (with a patch of at least 60 pixels wide or tall), the follower didn't lose much of its ability to track the subject matter. Not surprisingly, the higher the scan spacing, the lower the fidelity, but the faster. Doubling ScanSpacing means a 4-fold increase in the frame processing rate.

I am inclined to think the process demonstrated in the above video is analogous to what our own eyes do. In any busy motion scene, I think your eyes engage in a very muscle-intensive process of locking in, moment by moment, on stable points of interest. In this case, the follower's fixation is chosen at random, essentially. Whatever is in the center becomes the fixation point. Still, the result is that our eyes can see the video, frame by frame, as part of a continuous, stable world. By fixating on some point while the view is in motion, whether on a television or looking out a car window, we get that more stable view.

Finally, one thought that kinda drives this research, but is really secondary to it, is that this could be a practical algorithm for video stabilization. In fact, I suspect the makers of video cameras are using it in their digital stabilization. It would be interesting to see someone create a freeware product or plug-in for video editing software because the value seems pretty obvious.


Popular posts from this blog

Neural network in C# with multicore parallelization / MNIST digits demo

Discovering English syntax

Virtual lexicon vs Brown corpus