Machine vision of GUIs

[Audio Version]

I just completed a brief foray into machine vision with a project focusing on being able to see and to some degree "understand" windowed graphical user interfaces (GUIs) like Microsoft Windows. I wrote a test program and an essay on the subject, so I'd rather suggest you visit the project's home page instead of simply repeating its contents here. But I'll summarize briefly.

The base premise of my explorations is that most GUIs are composed of rectangular blocks within blocks. I called the core of the concept I was experimenting with "expansion" and "contraction" algorithms. "Expansion" here means starting with a test rectangle that begins inside a block and, like a balloon, expands outward until it finds the outer bounds of the current block. Similarly, "contraction" means starting with a rectangle that is just inside a rectangular block that gradually shrinks downward until it wraps snugly around the one or more inner blocks that punctuate the smooth outer bounds of the first block; like water filling a dry stream to expose the islands within it.

The main point of an analysis of a user's screen involving expansion and contraction to find the boundaries of the UI blocks would be to carve up a complex screen into smaller units that can be processed by other, more traditional vision systems. An optical character recognition (ORC) system, for example, might be able to read the text on a button or in a text box. A neural network might be used to recognize an icon on a button. A neural net or classifier system could be used to draw conclusions about what a particular arrangement of blocks within blocks might represent. It might, for example, be able to distinguish a word processor from a web browser.

Ultimately, there could be all sorts of applications of a system that can reasonably grasp most of the basic elements of a windowed GUI. I had fun writing a simple demonstration system that illustrates some of the strengths and weaknesses of the concept as I describe it in the accompanying essay. Plus I made the source code of that program available for download.


Popular posts from this blog

Coherence and ambiguities in problem solving

Discovering English syntax

Neural network in C# with multicore parallelization / MNIST digits demo