Machine vision: smoothing out textures

[Audio Version]

Following is another in my series of ad hoc journal entries I've been keeping of my thoughts on machine vision.

While reading up more on how texture analysis has been dealt with in recent years, I realized yesterday that there may be a straightforward way to do basic region segmentation based on textures. I consider texture-based segmentation to be one of two major obstacles to stepping above the trivial level of machine vision today toward general purpose machine vision. The other regards recognizing and ignoring illumination effects.

Something struck me a few hours after reading how one researcher chose to define textures. Among others, he made two interesting points. First, that a smooth texture must, when sampled at a sufficiently low resolution, cease to be a texture and instead become a homogeneous color field. Second, the size of a texture must be at least several times larger in dimensions (e.g., width and height) than a single textural unit. A rectangular texture composed of river rocks, for example, must be several rocks high and several rocks wide to actually be a texture.

Later, when I was trying to figure out what characteristics are worth consideration for texture-based segmentation that don't require me to engage in complicated mathematics, I remembered the concept I had been pursuing recently when I started playing with video image processing. I thought I could kill two birds with one stone by processing very low-res versions of source images: eliminating fine textures and reducing the number of pixels to process. I was disappointed, though, by the fact that low-res meant little information of value.

I realized that there was another way to get the benefit of blurring textures into smooth color fields without actually blurring (or lowering the resolution of - same thing) the images, per se. The principle is as follows.

Imagine an image that includes a sizable sampling of a texture. Perhaps it has a brick wall in the picture with no shadows, graffiti, or other significant confounding inclusions on the wall. The core principle is that there is a circle with a smallest radius CR (critical radius) that is large enough to be "representative" of at least one unit of that texture. In this case, what determines if it is representative is whether the circle can be placed anywhere within the bounds of the texture - the wall, in our example - and the average color of all the pixels within it will be almost exactly the same as if the circle were set anywhere else in that single-texture region.

If we want to identify the brick wall's texture as standing apart from the rest of the image, then, we have to do two things in this context. One, we need to find that critical radius (CR). Two, we need to populate the wall-textured region with enough CR circles so no part is left untested, yet no CR circle extends beyond the region. The enclosed region, then, is the candidate region.

I suppose this could work with squares, too. It doesn't have to be circles, but there may be some curious symmetric effects that come into play that I'm not aware of. Let's limit the discussion to circles, though.

So how does one determine the critical radius? A single random test won't do, because we don't know in advance that the test circle actually falls within a texture without some a priori knowledge. Our goal is to discover, not just validate.

I propose a dynamically varying grid of test circles that looks for local consistencies. Picture a grid in which at each node, there is centered a circle. The circles should overlap in such a way that there are no gaps. That is, the radius should be at least half the distance between one node and the node one unit down and across from it. In the first step, the CR (radius) chosen and hence the grid spacing would be small - two pixels, for example. As the test progresses, CR might grow by a simple doubling process or by some other multiplier. The grid would cover the entire image under consideration. The process would continue upward until the CR values chosen no longer allow for a sufficient number of sample circles to be created within the image.

The result of each pass of this process would be a new "image", with one pixel per grid node in the source image. That pixel's color would be the average of the colors of all the pixels within the test circle at that node. We would then search the new image for smooth color blobs using traditional techniques. Any significantly large blobs would be considered candidates for homogeneous textures.

I'm not entirely sure exactly how to make use of this information, but there's something intuitively satisfying about it. I've been thinking for a while now that we note the average colors of things and that that seems to be an important part of our way of categorizing and recognizing things. A recent illustration of this for me is a billboard I see on my way to work. It has a large region of cloudy sky, but the image is washed to an orangish-tan. From a distance, it looks to me just like the surface of a cheese pizza. So even though I know better, my first impression whenever I see this billboard - before I think about it - is of a cheese pizza. The pattern is obviously of sky and bears only modest resemblance to a pizza, but the overall color is very right.

Perhaps one way to use the resulting tiers of color blobs is to break down and analyze textures. Let's say I have one uniform color blob at tier N. I can look at the pixels of the N - 1, higher resolution version of this same region. One question I might ask is whether those pixels too are consistent. If so, maybe the texture is really just a smooth color region. If not, then maybe I really did capture a rough but consistent texture. I might then try to see how much variation there is in that higher resolution level. Maybe I can identify the two or three most prominent colors. In my sky-as-cheese-pizza example, it's clear that I see the dusty orange and white blobs collectively as appearing pizza-like; it's not just the average of the two colors. I could also use other conventional texture analysis techniques like co-occurrence matrices. Once I have the smoothness point (resolution) for a given color blob, I can perhaps double or quadruple the resolution to get it sufficiently rough for single-pixel-distances common in such analysis instead of having resolutions so high that such techniques don't work well.

Critics will be quick to point out that all I'm capturing in this algorithm is the ambient color of a texture. I might have a picture of oak trees tightly packed and adjacent to tightly packed pine trees. The ambient color of the two kinds of trees' foliage might be identical and so I would see them as a single grouping. To that I say the quip is valid, but probably irrelevant. I think it's reasonable to hypothesize that our own eyes probably deal with ambient texture color "before" they get into details like discriminating patterns. Further, I think a system that can successfully discriminate purely based on ambient texture color would probably be much farther ahead than alternatives I've seen to date. That is, it seems very practical.

Besides, the math is very simple, which is a compelling reason to me for believing it's something like how human vision might work. I can imagine the co-occurrence concept playing a role, but the combinatorics for a neural network that doesn't regular change its physical structure seem staggering. By contrast, it may take a long time for a linear processor to go through all these calculations, but the function is so simple and repetitive that it's pretty easy to imagine a few cortical layers implementing it all in parallel and getting results very quickly.

As a side note, I'm pretty well convinced that outside the fovea, our peripheral vision is doing most of its work using simple color blobs. Once we know what an object is, we just assume it's there as its color blobs move gradually around the periphery until the group of them moves out of view. It seems we track movement there, not details. The rest is just an internal model of what we assume is there. This strengthens my sense that within the fovea, there may be a more detail-oriented version of this same principle at work.

What I haven't figured out yet is how to deal with illumination effects. I suspect the same tricks that would be used for dealing with an untextured surface that has illumination effects on it would also be used on the lower resolution images generated by this technique. That is, the two problems would have to be processed in parallel. They could not be dealt with one before the other, I think.


Popular posts from this blog

Coherence and ambiguities in problem solving

Discovering English syntax

Neural network in C# with multicore parallelization / MNIST digits demo