Search This Blog

Sunday, October 21, 2007

Perception as construction of stable interpretations

I've been spending a lot of time lately thinking about the nature of perception. As I've said before, I believe AI has gotten stuck at the two coastlines of intelligence: the knee-jerk-reaction of the sensory level and the castles-in-the-sky of the conceptual level. We've been missing the huge interior of the perceptual level of intelligence. It's not that programmers are ignoring the problem. They just don't have much in the way of a theoretical framework to work with, yet. People don't really know yet how humans perceive, so it's hard to say how a machine could be made to perceive in a way familiar to humans.

Example of a stable interpretation

I've been focused very much on the principle of "stable interpretation" as a fundamental component of perception. To illustrate what I mean by "stable", consider the following short video clip:


Click here to open this WMV file

This is taken from a larger video I've used in other vision experiments. In this case, I've already applied a program that "stabilizes" a source video by tracking some central point as it moves from frame to frame and clipping out the margins. In this case, you can still see motion, though. The camera is tilting. The foreground is sliding from right to left. And there is a noticeable flicker of pixels because the source video is of a low resolution. On the other hand, you have no trouble at all perceiving each frame as part of a continuous scene. You don't see frames, really. You just see a rocky shore and sky in apparent motion as the camera moves along. That's what perception in a machine should be like, too.

The problem is that the interpretation of a static scene in which only the camera moves does not arise directly from the source data. If you were to simply watch a single pixel in this video as the frames progress, you'd see even that changes, literally. Also, individual rocks do move relative to the frame and to each other. Yet you easily sense that there's a rigid arrangement of rocks. How?

One way of forming a stable view is one I've dabbled in a long time: patch matching. In this case, I took a source video and put a smaller "patch" in it that's the size of the video frames you see here. With each passing frame, my code compares different places to move the patch frame to in hopes of finding the best matching candidate patch. In this case, you can see it works pretty well. But this is a very brittle algorithm. Were I to include subsequent frames, where a man runs through the scene, you would see that the patch "runs away" from the man because his motion breaks up the "sameness" from frame to frame. My interpretation is that the simple patch comparisons I use are insufficient; that this cheap trick is, at best, a small component in a larger toolset needed for constructing stable interpretations. A more robust system would be able to stay locked on the stable backdrop as the man runs through the scene, for instance.

What is a stable interpretation?

What makes an interpretation of information "stable"? The video example above is riddled with noise. One fundamental thing perception does is filter out noise. If, for example, I painted a red pixel in one frame of the video, you might notice it, but you would quickly conclude that it is random noise and ignore it. If I painted another red pixel in several more frames, you would no longer consider it noise, but some artifact with a significant cause. Seeing the same information repeated is the essence of non-randomness.

"Stability", in the context of perception, can be defined as "the coincidental repetition of information that suggests a persistent cause for that information."

My Pattern Sniffer program and blog entry illustrate one algorithm for learning that is based almost entirely on this definition of stability. The program is confronted with a series of patterns. Over time, individual neurons come to learn to strongly recognize the patterns. Even when I introduced random noise distorting the images, it still worked very well at learning "idealized" versions of the distorted patterns that do not reflect the noise. Shown a given image once, a "free" neuron might instantly learn it, but without repetition over time, it would quickly forget the pattern. My sense is that Pattern Sniffer's neuron bank algorithm is very reusable in many contexts of perception, but it's obviously not a complete vision system, per se.

What is repetition?

When I speak of repetition, in the context of Pattern Sniffer, it's obvious that I mean showing the neuron bank a given pattern many times. But that's not the only form of repetition that matters to perception. Consider the following pie chart image:

When you look at the "Comedy" (27%) wedge, you see that it is solid orange. You instantly perceive it as a continuous thing, separable from the rest of the image. Why? Because the orange color is repeated across many pixels. Here's a more interesting example image of a wall of bricks:

Your visual perception instantly grasps that the bricks are all the "same". Not literally, if you consider each pixel in each brick, but in a deep sense, you see them as all the same. The brick motif repeats itself in a regularized pattern.

When your two eyes are working properly, they will tend to fixate on the same thing. Your vision is thus recognizing that what your left eye sees is repeated also in your right eye, approximately.

In each of these cases, one can apply the patch comparison approach to searching for repeated patterns. This is just in the realm of vision and only considers 2D patches of source images. But the same principle can be applied to any form of input. A "patch" can be a 1D pattern in linear data, just the same. Or it could encompass a 4D space of taste components (sweet, salty, sour, bitter). The concept is the same, though. A "patch" of localized input elements (e.g., pixels) is compared to another patch in a different part of the input for repetition, whether it's repeated somewhere else in time or in another part of the input space.

Repetition as structure

We've seen that we can use coincidental repetitions of patterns as a way to separate "interesting" information from random noise. But we can do more with it. We can use pattern repetition as a way to discover structure in information.

Consider edges. Long ago, vision researchers discovered that our own visual systems can detect sharp contrasts in what we see and thus highlight them as edges. Implementing this in a computer turns out to be quite easy, as the following example illustrates:

It's tempting to think it is easy, then, to trace around these sharply contrasting regions to find whole textures or even whole objects. The problem is that in most natural scenarios, it doesn't work. Edges are interrupted because of low-contrast areas, as with the left-hand player's knee. Other non-edge textures like the grass are high enough contrast to appear as edges in this sort of algorithm. True, people have made algorithms to reduce noise like this using crafty means, but the bottom line is that this approach is not sufficient for detecting edges in a general case.

The clincher comes when an edge is demarked by a very soft, low-contrast transition or even a rough edge. Consider the following example of a gravel road, with its fuzzy edge:

As you can see, it's hard to find a high contrast edge to the road using a typical, pixel contrast algorithm. There's higher contrast to be found in the brush beyond the road's edge, in fact. But what if one started with a patch along the edge of the road (as we perceive it) and searched for similar patches? Some of the best matches would likely be along that same edge. As such, this soft and messy edge should be much more easily found. The following mockup illustrates this concept:

In addition to discovering "fuzzy" edges like this better, patch matching can be used to discover 2D "regions" within an image. The surface of the road above, or of the brush along the side of it might be better found than with the more common color flood-fill technique.

I've explored these ideas a bit in my research, but I want to make clear that I haven't come up with the right kinds of algorithms to make these practical tools of perception as of yet.

Pattern funneling

One problem that plagues me with machine vision research is that mechanisms like my Pattern Sniffer's neuron banks work great for learning to recognize things only when those things are perfectly placed within their soda-straw windows on the world. With Pattern Sniffer, the patterns are always lined up properly in a tiny array of pixels. It's not like it goes searching a large image for those known patterns, like a "Where's Waldo" search. For that kind of neuron bank to work well in a more general application, it's important for some other mechanism to "funnel" interesting information to the neuron bank that gains expertise in recognizing patterns.

Take textures, for instance. One algorithm could seek out textures by simply looking for localized repetition of a patch. A patch of grass could be a candidate, and other patch matches around that patch would help confirm that the first patch considered is not just a noisy fluke.

That patch, then, could be run through a neuron bank that knows lots of different textures. If it finds a strong match, it would say so. If not, a neuron in the bank that isn't yet an expert in some texture would temporarily learn the pattern. Subsequent repetition would help reinforce it for ever longer terms. This is what I mean by "funneling", in this case: distilling an entire texture down to a single, representative patch that is "standardized" for use by a simpler pattern-learning algorithm.

Assemblies of patterns

In principle, it should be possible to detect patterns composed of non-random coincidences of known patterns, too. Consider the above example of an image of grass and sky, along with some other stuff. Once it is established, using pattern funneling to a learned neuron bank, that the grass and sky textures were found in the image, these facts can be used as another pattern of input to another processor. Let's say we have a neuron bank that has, as inputs, the various known textures. After processing any single image, we have an indication of whether or not a given known texture is seen in that image, as indicated in the following diagram:

Shown lots of images that include a few different images of grassy fields with blue skies, this neuron bank should come to recognize this repeated pattern of grass + sky as a pattern of its own. We could term this an "assembly of patterns".

In a similar way, a different neuron bank could be set up with inputs that consider a time sequence of recognized patterns. It could be musical notes, for example, with each musical note being one dimension of input, and the last, say, 10 notes being another dimension of input. As such, this neuron bank could learn to recognize and separate simple melodies from random notes.

The goal: perception

The goal, as stated above, is to make a machine able to perceive objects, events, and attributes in a way that is more sophisticated, like humans have, than the trivial sensory level many robots and AI programs deal with today. My sense is that the kinds of abstractions described above take me a little closer to that goal. But there's a lot more ground to cover.

For one thing, I really should try coding algorithms like the ones I've hypothesized about here.

One of the big limitations I can still see in this patch-centric approach to pattern recognition is the age-old problem of pattern invariance. I may make an algorithm that can recognize a pea on a white table at one scale, but as soon as I zoom in the camera a little, the pea looks much bigger, and no longer is easily recognizable using a single-patch match against the previously known pea archetype. Perhaps some sort of pattern funneling could be made that deals specifically with scaling images to a standardized size and orientation before recognizing / learning algorithms get involved. Perhaps salience concepts, which seek out points of interest in a busy source image, could be used to help in pattern funneling, too.

Still, I think there's merit in vigorously pursuing this overarching notion of stable interpretations as a primary mechanism of perception.

Sunday, October 14, 2007

Rebuttal of the Chinese Room Argument

While discussing the subject of Artificial Intelligence in another forum, someone brought up the old "Chinese Room" argument against the possibility of AI. My wife suggested I post my response to the point, as it seems a good rebuttal of the argument itself.

If you're unfamiliar with the CR argument, there's a great entry in the Stanford Encyclopedia of Philosophy. It summarizes as follows:

    The argument centers on a thought experiment in which someone who knows only English sits alone in a room following English instructions for manipulating strings of Chinese characters, such that to those outside the room it appears as if someone in the room understands Chinese. The argument is intended to show that while suitably programmed computers may appear to converse in natural language, they are not capable of understanding language, even in principle. Searle argues that the thought experiment underscores the fact that computers merely use syntactic rules to manipulate symbol strings, but have no understanding of meaning or semantics. Searle's argument is a direct challenge to proponents of Artificial Intelligence, and the argument also has broad implications for functionalist and computational theories of meaning and of mind. As a result, there have been many critical replies to the argument.

To my thinking, this is a basically flawed argument from the start. What if the instructions were given in English by another, Chinese-speaking (yes, I know "Chinese" is not a language) person? Really, the human following the processing rules is just a conduit for those processing rules. He might as well be a mail courier with no inkling what's in the envelope he's delivering. It doesn't mean the person who sent the mail is not intelligent. The CR argument says absolutely nothing about the nature of the data processing rules. It dismisses the possibility that those rules could constitute an intelligent program without consideration.

I think the CR argument holds some sway with people because they've seen the famous Eliza program from 1966 and tons of other chatbots based on it. Most of them take a sentence you type and respond to it either by reformulating it (e.g., replying to "I like chocolate" with "why do you like chocolate?") using predefined rules or by looking up random responses to certain keywords (e.g., responding to a search on "chocolate" in "I like chocolate" with "Willy Wonka and the Chocolate Factory grossed $475 million in box office receipts.")

Anyone who has interacted with a chatbot like this recognizes that it's easy to be fooled, at first, by this sort of trickery. The problem with the Chinese Room argument is that it posits that this is all a computer can do, without providing any real proof. In fact, the human mind is the product of the human nervous system and, really, the whole body. But that body is a machine. It's constructed of material parts that all obey physical laws. A computer is no different in this sense. What separates a cheap computer trick like Eliza from a human mind is how their systems are structured.

I take it as obvious, these days, that it's possible to make a machine that can reason and act "intelligent" like we do, generally. And I've never seen the CR argument as having any real bearing on the possibility of intelligent machines. It only provides a cautionary note about the difference between faking intelligence and actually being intelligent.

Sunday, October 7, 2007

Video stabilizer

I haven't had much chance to do coding for my AI research of late. My most recent experiment dealt more with patch matching in video streams. Here's a source video, taken from a hot air balloon, with a run of what I'll call a "video stabilizer" applied:


Full video with "follower" frame.
Click here to open this WMV file

Contents of the follower frame.
Click here to open this WMV file

The colored "follower" frame in the left video does its best to lock onto the subject it first sees when it appears. As the follower moves off center, a new frame is created in the center to take over. The right video is of the contents of the colored frame. (If the two videos appear out of sync, try refreshing this page once the videos are totally loaded.)

This algorithm does a surprisingly good job of tracking the ambient movement in this particular video. That was the point, though. I wondered how well a visual system could learn to identify stable patterns in a video if the video was not stable in the first place. I reasoned that an algorithm like this could help a machine vision system to make the world a little more stable for second level processing of source video.

The algorithm for this feat is unbelievably simple. I have a code class representing a single "follower" object. A follower has a center point, relative to the source video, and a width and height. We'll call this a "patch" of the video frame. With each passing frame, it does a bit-level comparison of what's inside the current patch against the contents of the next video frame, in search of a good match.

For each patch considered in the next frame, a difference calculation is performed, which is very simple. For each pixel in the two corresponding patches (current-frame and next-frame) under consideration, the difference in the red, green, and blue values are added to a running difference total. The candidate patch that has the lowest total difference is considered the best match and is thus where the follower goes in this next frame. Here's the code for comparing the current patch against a candidate patch in the next frame:


private int CompareRegions(int OffsetX, int OffsetY) {
int X, Y, Diff;
Color A, B;

const int ScanSpacing = 10;

Diff = 0;

for (Y = CenterY - RadiusY; Y <= CenterY + RadiusY; Y += ScanSpacing) {
for (X = CenterX - RadiusX; X <= CenterX + RadiusX; X += ScanSpacing) {
A = GetPixel(CurrentBmp, X, Y);
B = GetPixel(NextBmp, X + OffsetX, Y + OffsetY);
Diff +=
Math.Abs(A.R - B.R) +
Math.Abs(A.G - B.G) +
Math.Abs(A.B - B.B);
}
}

return Diff;
}

Assuming the above gibberish makes any sense, you may notice "Y += ScanSpacing" and the same for X. That's an optimization. In fact, the program does include a number of performance optimizations that help make the run-time on these processes more bearable. First, a follower doesn't consider all possible patches in the next frame to decide where to move. It only considers patches within a certain radius of the current location. OffsetX, for example, may only be +/- 50 pixels, which means if the subject matter in the video slides horizontally more than 50 pixels between frames, the algorithm won't work right. Still, this can increase frame processing rates 10-fold, with smaller search radii yielding shorter run-times.

As for "Y += ScanSpacing", that was a shot in the dark for me. I was finding frame processing was taking a very long time, still. So I figured, why not skip every Nth pixel in the patches during the patch comparison operation? I was surprised to find that even with ScanSpacing of 10 (with a patch of at least 60 pixels wide or tall), the follower didn't lose much of its ability to track the subject matter. Not surprisingly, the higher the scan spacing, the lower the fidelity, but the faster. Doubling ScanSpacing means a 4-fold increase in the frame processing rate.

I am inclined to think the process demonstrated in the above video is analogous to what our own eyes do. In any busy motion scene, I think your eyes engage in a very muscle-intensive process of locking in, moment by moment, on stable points of interest. In this case, the follower's fixation is chosen at random, essentially. Whatever is in the center becomes the fixation point. Still, the result is that our eyes can see the video, frame by frame, as part of a continuous, stable world. By fixating on some point while the view is in motion, whether on a television or looking out a car window, we get that more stable view.

Finally, one thought that kinda drives this research, but is really secondary to it, is that this could be a practical algorithm for video stabilization. In fact, I suspect the makers of video cameras are using it in their digital stabilization. It would be interesting to see someone create a freeware product or plug-in for video editing software because the value seems pretty obvious.