Search This Blog

Tuesday, November 13, 2007

Confirmation bias as a tool of perception

I've been trying to figure out where to go next with my study of perception. One concept I'm exploring is the idea that our expectations enhance our ability to recognize patterns.

I recently found a brilliant illustration of this from researcher Matt Davis, who studies how humans process language. Try out the following audio samples. Listen to the first one several times. It's a "vocoded" version of the plain English recording that follows. Can you tell what's being said?

Vocoded version.

Click here to open this WAV file

Give up? Now listen to the plain English version once and then listen to the vocoded version again.

Clear English version.

Click here to open this WAV file

Davis refers to this a-ha effect as "pop-out":

    Perhaps the clearest case of pop-out occurs if you listen to a vocoded sentence before and immediately after you hear the same sentence in vocoded form. It is likely that the vocoded sentence will sound a lot clearer when you know the identity of that sentence.

To me, this is a wonderful example of confirmation bias. Once you have an expectation of what to look for in the data, you quickly find it.

How does this relate to perception? I believe that recognizing patterns in real world data involves not only the data causing simple pattern matching to occur (bottom up), but also higher level expectations prompting the lower levels to search for expected patterns (top down). To help illustrate and explain, consider how you might engineer a specific task of perception: detecting a straight line in a picture. If you're familiar with machine vision, you'll know this is an age-old problem that has been fairly well solved using some good algorithms. Still, it's not trivial. Consider the following illustration of a picture of a building and some of the steps leading up to our thought experiment:

The first three steps we'll take are pretty conventional ones. First, we get our source image. Second, we apply a filter that looks at each pixel to see if it strongly contrasts with its neighbors. Our output is represented by a grayscale image, with black pixels representing strong contrasts in the source image. In our third step, we "threshold" our contrast image so each pixel goes either to black or white; no shades of gray.

Here's where our line detection begins. We'll say we start by making a list of all sets of neighboring black pixels that have, say, 10 or more pixels touching one another. Next, we filter these by seeing which have a large number of those pixels roughly fitting a line function. We end up with a bunch of small line segments. Traditionally, we could stop here, but we don't have to. We could pick any of these line segments and extend it out in either direction to see how far it can go and still find black pixels that roughly fit that line function. We might even tolerate a gap of a white pixel or two as we continue extending out. And we might try different variations of the line function that still fit but fit better as the line segment gets longer, in order to further refine the line function. But then uncertainty kicks in and we conservatively stop stretching out when we no longer see black pixels.

Here's where confirmation bias can help. Once we have a bunch of high-certainty line segments to work with, we now have expectations set about where lines form. So maybe we take our line segments back to the grayscale version of the contrast image. To my thinking, those gray pixels that got thresholded to white earlier still contain useful information. In fact, each grey pixel in the hypothesized line provides "evidence" that the line continues onward; that the "hypothesis" is "valid". It doesn't even matter that there may be lots of other grey -- or even black -- pixels just outside the hypothesized line. They don't add to or distract from the hypothesis. Only the "positive confirmation" of grey pixels adds weight to the hypothesis that the line extends further than we could tell by the black pixels in the thresholded version. Naturally, as the line extends out, we may get to a point where most of the pixels are white or light. Then we stop extending our line.

I love this example. It shows how we can start with the source data "suggesting" certain known patterns (here, lines) and that a higher level model can then set expectations about bigger patterns that are not immediately visible (longer lines) and use otherwise "weak evidence" (light grey pixels) as additional confirmation that such patterns are indeed found. To me, this is a wonderful illustration of inductive reasoning at work. The dark pixels may give strong, deductive proof of the existence of lines in the source data, but the light pixels that fit the extended line functions give weaker inductive evidence of the same.

I don't mean to suggest that perception is now solved. This example works because I've predefined a model of an "object"; here, a line. I could extend the example to search for ellipses, rectangles, and so on. But having to predefine these primitive object types seems to miss the point that we are quite capable of discovering these and much more sophisticated models for ourselves. There's no real learning in my example; only refinement. Still, I like that this illustrates how confirmation bias -- something of a dirty phrase in the worlds of science and politics -- probably plays a central role in the nature of perception.

Tuesday, November 6, 2007

What bar code scanners can tell us about perception

It may not be obvious, but a basic bar code scanner does something that machine vision researchers would love to see their own systems do: find objects amidst noisy backgrounds of visual information. What is an "object" to a bar code scanner? To answer that, let's start by explaining what a bar code is.

What is a bar code?

You've probably seen bar codes everywhere. Typically, they are represented as a series of vertical bars with a number or code underneath. There are many standards for bar codes, but we'll limit ourselves to one narrow class, typified by the following example:

This sort of bar code has a start code and an end code. These typically feature a very wide bar. One of its main purposes is to serve as a standard for bar widths. This is sometimes 4x the unit width for a bar. The remaining bars and gaps between them will be some multiple of that unit width (e.g., 1x, 2x, or 3x). Each sequence of bars and gaps relates to a unique number (or letter or other symbol) that is specified in advance by the standard for that kind of bar code.

A bar code scanner, like the handheld version pictured at right, doesn't actually care that the code is 2D, as you see it. To the scanner, the input is a stream of alternating light and dark signals, typically furnished by a laser signal bouncing off white paper or being absorbed by black ink (or reflecting / not reflecting off an aluminum can, etc.). If you're a programmer or PhotoShop guru, you could visualize this as starting with a digital snapshot of a bar code and cropping away all but a single pixel line of the image that cuts across the bar code, then applying a threshold to convert it into a black and white image devoid of color and even shades of gray.

The size of the bar code doesn't much matter, either. Within a certain, wide range, a bar code scanner will take any string of solid black as a potential start of a bar code, whether it's small or large and whether it's off to the left or the right of the center of the scanner's view.

What the scanner is doing with this stream of information is looking for the beginning and ending of a black section and using that first sample as a cue to look for the rest of the start code (or stop code; the bar code could be upside down) following it. If it finds that pattern, it continues looking for the patterns that follow, translating them into the appropriate digits, letters, or symbols, until it reaches the stop code.

Now, bar codes are often damaged. And they often appear in a noisy background of information. In fact, the inventors of bar code standards are very aware that a random pattern on a printed page could be misinterpreted as a bar code. They dealt with this by adding in several checks. For instance, one or more of the digits in a bar code are reserved as a "check code", the output of a mathematical function applied to the other data. The scanner applies the same function. If the output isn't the same as what the check code read in is, the candidate bar code scan is rejected as corrupt. Even the digit representations themselves contain only a small subset of all possible bar/gap combinations in order to reduce the chances that an errant spot or other invalid information could be misconstrued as a valid bar code. In fact, the odds that a bar code scanner could misread a bar code like the one above are so infinitesimally small that engineers and clerks can place nearly 100% confidence in their bar codes. A bar code either does or does not scan. There's no "kinda".

Seeing things

Bar codes have been engineered so well that it's possible to leave a scanner turned on 24/7, scanning out over a wide area, seeing all sorts of noise continuously, and be nearly 100% guaranteed that when it thinks it sees a bar code in the environment, it is correct. Some warehouses feature stationary bar code scanners that scan large boxes as they are moved along by fork lifts, for instance.

What does this have to do with machine vision? Isn't it amazing that a bar code scanner can deal with an incredibly noisy environment and still have a nearly 100% accuracy when it finds a bar code? This is very much like how you can pick out a human face in a busy picture with nearly 100% accuracy. There's all sorts of things that may ping your face recognition capacity, but when your focus is brought to bear on them, your skill at filtering out noise and correctly identifying the real faces is incredible, just like the bar code scanner. What's more, it doesn't matter where in your visual field the face is and how near or far it is, within a reasonable range. Just like the scanner.

Vision researchers are still hard pressed to provide an accounting of how we perceive the world visually. Machine vision researchers have been doing all sorts of neat things for decades, but we're still barely scratching the surface, here, for lack of a comprehensive theory of perception. Yet engineers creating bar codes decades ago actually solved this problem in a narrow case.

A good bar code scanner has an elegant solution to the problems of noise, scale invariance (zoom & offset), bounds detection (via start and stop codes). They even made it so a single bar code could represent one of billions of unique messages, not just be a simple there/not-there marker.

The bigger picture

Of course, I don't want to suggest that bar code scanners hold the key to solving the basic problem of perception. You probably have already guessed that the secret to bar codes is that they follow well engineered standards that make it almost easy to pick bar codes out of a noisy environment. Vision researchers have likewise made many systems that are quite capable of picking out human faces, as well as a variety of special classes of clearly definable objects.

It's pretty much accepted wisdom in human brain research now that much of what we see in the world is what we are looking to find. A bar code scanner works because it knows what to look for. Obviously, one key difference between your perceptual faculty and a bar code scanner is that the scanner is "born" with all the knowledge it needs, while you have to learn how faces, chairs, and cars "work" for yourself.

Still, for people wondering how to approach the question of perception, bar coding is not a bad analogy to start with.

Sunday, October 21, 2007

Perception as construction of stable interpretations

I've been spending a lot of time lately thinking about the nature of perception. As I've said before, I believe AI has gotten stuck at the two coastlines of intelligence: the knee-jerk-reaction of the sensory level and the castles-in-the-sky of the conceptual level. We've been missing the huge interior of the perceptual level of intelligence. It's not that programmers are ignoring the problem. They just don't have much in the way of a theoretical framework to work with, yet. People don't really know yet how humans perceive, so it's hard to say how a machine could be made to perceive in a way familiar to humans.

Example of a stable interpretation

I've been focused very much on the principle of "stable interpretation" as a fundamental component of perception. To illustrate what I mean by "stable", consider the following short video clip:


Click here to open this WMV file

This is taken from a larger video I've used in other vision experiments. In this case, I've already applied a program that "stabilizes" a source video by tracking some central point as it moves from frame to frame and clipping out the margins. In this case, you can still see motion, though. The camera is tilting. The foreground is sliding from right to left. And there is a noticeable flicker of pixels because the source video is of a low resolution. On the other hand, you have no trouble at all perceiving each frame as part of a continuous scene. You don't see frames, really. You just see a rocky shore and sky in apparent motion as the camera moves along. That's what perception in a machine should be like, too.

The problem is that the interpretation of a static scene in which only the camera moves does not arise directly from the source data. If you were to simply watch a single pixel in this video as the frames progress, you'd see even that changes, literally. Also, individual rocks do move relative to the frame and to each other. Yet you easily sense that there's a rigid arrangement of rocks. How?

One way of forming a stable view is one I've dabbled in a long time: patch matching. In this case, I took a source video and put a smaller "patch" in it that's the size of the video frames you see here. With each passing frame, my code compares different places to move the patch frame to in hopes of finding the best matching candidate patch. In this case, you can see it works pretty well. But this is a very brittle algorithm. Were I to include subsequent frames, where a man runs through the scene, you would see that the patch "runs away" from the man because his motion breaks up the "sameness" from frame to frame. My interpretation is that the simple patch comparisons I use are insufficient; that this cheap trick is, at best, a small component in a larger toolset needed for constructing stable interpretations. A more robust system would be able to stay locked on the stable backdrop as the man runs through the scene, for instance.

What is a stable interpretation?

What makes an interpretation of information "stable"? The video example above is riddled with noise. One fundamental thing perception does is filter out noise. If, for example, I painted a red pixel in one frame of the video, you might notice it, but you would quickly conclude that it is random noise and ignore it. If I painted another red pixel in several more frames, you would no longer consider it noise, but some artifact with a significant cause. Seeing the same information repeated is the essence of non-randomness.

"Stability", in the context of perception, can be defined as "the coincidental repetition of information that suggests a persistent cause for that information."

My Pattern Sniffer program and blog entry illustrate one algorithm for learning that is based almost entirely on this definition of stability. The program is confronted with a series of patterns. Over time, individual neurons come to learn to strongly recognize the patterns. Even when I introduced random noise distorting the images, it still worked very well at learning "idealized" versions of the distorted patterns that do not reflect the noise. Shown a given image once, a "free" neuron might instantly learn it, but without repetition over time, it would quickly forget the pattern. My sense is that Pattern Sniffer's neuron bank algorithm is very reusable in many contexts of perception, but it's obviously not a complete vision system, per se.

What is repetition?

When I speak of repetition, in the context of Pattern Sniffer, it's obvious that I mean showing the neuron bank a given pattern many times. But that's not the only form of repetition that matters to perception. Consider the following pie chart image:

When you look at the "Comedy" (27%) wedge, you see that it is solid orange. You instantly perceive it as a continuous thing, separable from the rest of the image. Why? Because the orange color is repeated across many pixels. Here's a more interesting example image of a wall of bricks:

Your visual perception instantly grasps that the bricks are all the "same". Not literally, if you consider each pixel in each brick, but in a deep sense, you see them as all the same. The brick motif repeats itself in a regularized pattern.

When your two eyes are working properly, they will tend to fixate on the same thing. Your vision is thus recognizing that what your left eye sees is repeated also in your right eye, approximately.

In each of these cases, one can apply the patch comparison approach to searching for repeated patterns. This is just in the realm of vision and only considers 2D patches of source images. But the same principle can be applied to any form of input. A "patch" can be a 1D pattern in linear data, just the same. Or it could encompass a 4D space of taste components (sweet, salty, sour, bitter). The concept is the same, though. A "patch" of localized input elements (e.g., pixels) is compared to another patch in a different part of the input for repetition, whether it's repeated somewhere else in time or in another part of the input space.

Repetition as structure

We've seen that we can use coincidental repetitions of patterns as a way to separate "interesting" information from random noise. But we can do more with it. We can use pattern repetition as a way to discover structure in information.

Consider edges. Long ago, vision researchers discovered that our own visual systems can detect sharp contrasts in what we see and thus highlight them as edges. Implementing this in a computer turns out to be quite easy, as the following example illustrates:

It's tempting to think it is easy, then, to trace around these sharply contrasting regions to find whole textures or even whole objects. The problem is that in most natural scenarios, it doesn't work. Edges are interrupted because of low-contrast areas, as with the left-hand player's knee. Other non-edge textures like the grass are high enough contrast to appear as edges in this sort of algorithm. True, people have made algorithms to reduce noise like this using crafty means, but the bottom line is that this approach is not sufficient for detecting edges in a general case.

The clincher comes when an edge is demarked by a very soft, low-contrast transition or even a rough edge. Consider the following example of a gravel road, with its fuzzy edge:

As you can see, it's hard to find a high contrast edge to the road using a typical, pixel contrast algorithm. There's higher contrast to be found in the brush beyond the road's edge, in fact. But what if one started with a patch along the edge of the road (as we perceive it) and searched for similar patches? Some of the best matches would likely be along that same edge. As such, this soft and messy edge should be much more easily found. The following mockup illustrates this concept:

In addition to discovering "fuzzy" edges like this better, patch matching can be used to discover 2D "regions" within an image. The surface of the road above, or of the brush along the side of it might be better found than with the more common color flood-fill technique.

I've explored these ideas a bit in my research, but I want to make clear that I haven't come up with the right kinds of algorithms to make these practical tools of perception as of yet.

Pattern funneling

One problem that plagues me with machine vision research is that mechanisms like my Pattern Sniffer's neuron banks work great for learning to recognize things only when those things are perfectly placed within their soda-straw windows on the world. With Pattern Sniffer, the patterns are always lined up properly in a tiny array of pixels. It's not like it goes searching a large image for those known patterns, like a "Where's Waldo" search. For that kind of neuron bank to work well in a more general application, it's important for some other mechanism to "funnel" interesting information to the neuron bank that gains expertise in recognizing patterns.

Take textures, for instance. One algorithm could seek out textures by simply looking for localized repetition of a patch. A patch of grass could be a candidate, and other patch matches around that patch would help confirm that the first patch considered is not just a noisy fluke.

That patch, then, could be run through a neuron bank that knows lots of different textures. If it finds a strong match, it would say so. If not, a neuron in the bank that isn't yet an expert in some texture would temporarily learn the pattern. Subsequent repetition would help reinforce it for ever longer terms. This is what I mean by "funneling", in this case: distilling an entire texture down to a single, representative patch that is "standardized" for use by a simpler pattern-learning algorithm.

Assemblies of patterns

In principle, it should be possible to detect patterns composed of non-random coincidences of known patterns, too. Consider the above example of an image of grass and sky, along with some other stuff. Once it is established, using pattern funneling to a learned neuron bank, that the grass and sky textures were found in the image, these facts can be used as another pattern of input to another processor. Let's say we have a neuron bank that has, as inputs, the various known textures. After processing any single image, we have an indication of whether or not a given known texture is seen in that image, as indicated in the following diagram:

Shown lots of images that include a few different images of grassy fields with blue skies, this neuron bank should come to recognize this repeated pattern of grass + sky as a pattern of its own. We could term this an "assembly of patterns".

In a similar way, a different neuron bank could be set up with inputs that consider a time sequence of recognized patterns. It could be musical notes, for example, with each musical note being one dimension of input, and the last, say, 10 notes being another dimension of input. As such, this neuron bank could learn to recognize and separate simple melodies from random notes.

The goal: perception

The goal, as stated above, is to make a machine able to perceive objects, events, and attributes in a way that is more sophisticated, like humans have, than the trivial sensory level many robots and AI programs deal with today. My sense is that the kinds of abstractions described above take me a little closer to that goal. But there's a lot more ground to cover.

For one thing, I really should try coding algorithms like the ones I've hypothesized about here.

One of the big limitations I can still see in this patch-centric approach to pattern recognition is the age-old problem of pattern invariance. I may make an algorithm that can recognize a pea on a white table at one scale, but as soon as I zoom in the camera a little, the pea looks much bigger, and no longer is easily recognizable using a single-patch match against the previously known pea archetype. Perhaps some sort of pattern funneling could be made that deals specifically with scaling images to a standardized size and orientation before recognizing / learning algorithms get involved. Perhaps salience concepts, which seek out points of interest in a busy source image, could be used to help in pattern funneling, too.

Still, I think there's merit in vigorously pursuing this overarching notion of stable interpretations as a primary mechanism of perception.

Sunday, October 14, 2007

Rebuttal of the Chinese Room Argument

While discussing the subject of Artificial Intelligence in another forum, someone brought up the old "Chinese Room" argument against the possibility of AI. My wife suggested I post my response to the point, as it seems a good rebuttal of the argument itself.

If you're unfamiliar with the CR argument, there's a great entry in the Stanford Encyclopedia of Philosophy. It summarizes as follows:

    The argument centers on a thought experiment in which someone who knows only English sits alone in a room following English instructions for manipulating strings of Chinese characters, such that to those outside the room it appears as if someone in the room understands Chinese. The argument is intended to show that while suitably programmed computers may appear to converse in natural language, they are not capable of understanding language, even in principle. Searle argues that the thought experiment underscores the fact that computers merely use syntactic rules to manipulate symbol strings, but have no understanding of meaning or semantics. Searle's argument is a direct challenge to proponents of Artificial Intelligence, and the argument also has broad implications for functionalist and computational theories of meaning and of mind. As a result, there have been many critical replies to the argument.

To my thinking, this is a basically flawed argument from the start. What if the instructions were given in English by another, Chinese-speaking (yes, I know "Chinese" is not a language) person? Really, the human following the processing rules is just a conduit for those processing rules. He might as well be a mail courier with no inkling what's in the envelope he's delivering. It doesn't mean the person who sent the mail is not intelligent. The CR argument says absolutely nothing about the nature of the data processing rules. It dismisses the possibility that those rules could constitute an intelligent program without consideration.

I think the CR argument holds some sway with people because they've seen the famous Eliza program from 1966 and tons of other chatbots based on it. Most of them take a sentence you type and respond to it either by reformulating it (e.g., replying to "I like chocolate" with "why do you like chocolate?") using predefined rules or by looking up random responses to certain keywords (e.g., responding to a search on "chocolate" in "I like chocolate" with "Willy Wonka and the Chocolate Factory grossed $475 million in box office receipts.")

Anyone who has interacted with a chatbot like this recognizes that it's easy to be fooled, at first, by this sort of trickery. The problem with the Chinese Room argument is that it posits that this is all a computer can do, without providing any real proof. In fact, the human mind is the product of the human nervous system and, really, the whole body. But that body is a machine. It's constructed of material parts that all obey physical laws. A computer is no different in this sense. What separates a cheap computer trick like Eliza from a human mind is how their systems are structured.

I take it as obvious, these days, that it's possible to make a machine that can reason and act "intelligent" like we do, generally. And I've never seen the CR argument as having any real bearing on the possibility of intelligent machines. It only provides a cautionary note about the difference between faking intelligence and actually being intelligent.

Sunday, October 7, 2007

Video stabilizer

I haven't had much chance to do coding for my AI research of late. My most recent experiment dealt more with patch matching in video streams. Here's a source video, taken from a hot air balloon, with a run of what I'll call a "video stabilizer" applied:


Full video with "follower" frame.
Click here to open this WMV file

Contents of the follower frame.
Click here to open this WMV file

The colored "follower" frame in the left video does its best to lock onto the subject it first sees when it appears. As the follower moves off center, a new frame is created in the center to take over. The right video is of the contents of the colored frame. (If the two videos appear out of sync, try refreshing this page once the videos are totally loaded.)

This algorithm does a surprisingly good job of tracking the ambient movement in this particular video. That was the point, though. I wondered how well a visual system could learn to identify stable patterns in a video if the video was not stable in the first place. I reasoned that an algorithm like this could help a machine vision system to make the world a little more stable for second level processing of source video.

The algorithm for this feat is unbelievably simple. I have a code class representing a single "follower" object. A follower has a center point, relative to the source video, and a width and height. We'll call this a "patch" of the video frame. With each passing frame, it does a bit-level comparison of what's inside the current patch against the contents of the next video frame, in search of a good match.

For each patch considered in the next frame, a difference calculation is performed, which is very simple. For each pixel in the two corresponding patches (current-frame and next-frame) under consideration, the difference in the red, green, and blue values are added to a running difference total. The candidate patch that has the lowest total difference is considered the best match and is thus where the follower goes in this next frame. Here's the code for comparing the current patch against a candidate patch in the next frame:


private int CompareRegions(int OffsetX, int OffsetY) {
int X, Y, Diff;
Color A, B;

const int ScanSpacing = 10;

Diff = 0;

for (Y = CenterY - RadiusY; Y <= CenterY + RadiusY; Y += ScanSpacing) {
for (X = CenterX - RadiusX; X <= CenterX + RadiusX; X += ScanSpacing) {
A = GetPixel(CurrentBmp, X, Y);
B = GetPixel(NextBmp, X + OffsetX, Y + OffsetY);
Diff +=
Math.Abs(A.R - B.R) +
Math.Abs(A.G - B.G) +
Math.Abs(A.B - B.B);
}
}

return Diff;
}

Assuming the above gibberish makes any sense, you may notice "Y += ScanSpacing" and the same for X. That's an optimization. In fact, the program does include a number of performance optimizations that help make the run-time on these processes more bearable. First, a follower doesn't consider all possible patches in the next frame to decide where to move. It only considers patches within a certain radius of the current location. OffsetX, for example, may only be +/- 50 pixels, which means if the subject matter in the video slides horizontally more than 50 pixels between frames, the algorithm won't work right. Still, this can increase frame processing rates 10-fold, with smaller search radii yielding shorter run-times.

As for "Y += ScanSpacing", that was a shot in the dark for me. I was finding frame processing was taking a very long time, still. So I figured, why not skip every Nth pixel in the patches during the patch comparison operation? I was surprised to find that even with ScanSpacing of 10 (with a patch of at least 60 pixels wide or tall), the follower didn't lose much of its ability to track the subject matter. Not surprisingly, the higher the scan spacing, the lower the fidelity, but the faster. Doubling ScanSpacing means a 4-fold increase in the frame processing rate.

I am inclined to think the process demonstrated in the above video is analogous to what our own eyes do. In any busy motion scene, I think your eyes engage in a very muscle-intensive process of locking in, moment by moment, on stable points of interest. In this case, the follower's fixation is chosen at random, essentially. Whatever is in the center becomes the fixation point. Still, the result is that our eyes can see the video, frame by frame, as part of a continuous, stable world. By fixating on some point while the view is in motion, whether on a television or looking out a car window, we get that more stable view.

Finally, one thought that kinda drives this research, but is really secondary to it, is that this could be a practical algorithm for video stabilization. In fact, I suspect the makers of video cameras are using it in their digital stabilization. It would be interesting to see someone create a freeware product or plug-in for video editing software because the value seems pretty obvious.

Thursday, September 27, 2007

"Conscious Realism" and "Multimodal User Interface" theories

I recently sent an email to Donald Hoffman, professor at the University of California, Irvine, with kudos for his book, Visual Intelligence, which has had a profound impact on my thinking about perception. Understandably, he's very busy kicking off the new school year, so I was grateful that he sent at least brief response and a reference to his latest published paper, titled Conscious Realism and the Mind-Body Problem. Naturally, I was eager to read it.

Much of the study of how human consciousness arises stems from the assumption that consciousness is a product of physical processes; that consciousness is a product of a physical processes in the brain. This paper starts from the opposite assumption: that "consciousness creates brain activity, and indeed creates all objects and properties of the physical world." When I read this in the abstract, I must have largely ignored its significance. Having read Visual Intelligence, I'm familiar with Hoffman's focus on how our minds construct the things we perceive, so I took this summary as a shorthand for this concept of construction of the contents of consciousness. It becomes apparent that this claim is far more literal than I had assumed.

Hoffman begins by explaining the ubiquity of a central assumption as follows. "A goal of perception is to match or approximate true properties of an objective physical environment. We call this the hypothesis of faithful depiction (HFD)." After giving lots of examples of this assumption and reasons why it's taken for granted, Hoffman declares his rejection of it:

    I now think HFD is false. Our perceptual systems do not try to approximate properties of an objective physical world. Moreover evolutionary considerations, properly understood, do not support HFD, but require its rejection.

Now, I'll state here that most of Hoffman's claims in this paper appear logically valid and, on the face of it, uncontested. But I would have to say that this one probably isn't logically supported: that evolutionary considerations require the rejection of HFD. By and large, however, this paper claims that it is not necessary to assume there is an objective physical world in order to study and understand consciousness, which seems acceptable.

The term "objective physical world" deserves some explanation. It identifies the view that there is a single reality that exists without regard to observers. If there is an apple on the table before two people, the apple really is there, whether either of them perceives it. Naturally, one would imagine that if one of them can see the apple, the other one probably can (barring obstructions), because both of them have access to information (e.g., light) reflected off the apple and into both their eyes. They may see different sides of the apple, but the apple is definitively there.

To be sure, one should not dismiss Hoffman as a fringe nut that claims there is no reality, per se; only people and their subjective consciousnesses. He doesn't in this paper. In fact, it's clear he does appear to accept the assumption that there really is an objective reality, but that we don't have direct "access" to it. A classic example of this distinction is a detailed treatment of the table not as a solid object with straight edged surfaces, but as a collection of atoms and, mostly, empty space and, as such, rough, continuously changing surfaces. In this sense, there really isn't a table; that's just a percept (or concept) we use to refer to the collection of atoms.

To help illustrate the distinction between what one perceives and the subject matter of perception, Hoffman introduces the analogy of deleting a computer file by "dragging" a file icon and "dropping" it onto a trash can icon. This action is intuitive and designed specifically as an analogy of the actual file delete operation, but it actually bears no resemblance to what actually goes on under the surface. In fact, even the icon is not equivalent to the file; it's merely a percept specifically designed to represent it to the end user. By analogy, Hoffman refers to the table or the apple as merely "icons" we create in our minds to represent what most people would reflexively call "real objects". In fact, to the person who says, "no, the apple is just a bunch of atoms," Hoffman would in turn say, "the atoms are themselves icons we create."

Hoffman introduces the term "multimodal user interface", or "MUI", to summarize what consciousness is. In contrast to the view that perception is all about constructing a mental model of reality that closely resembles reality, Hoffman claims perception is about constructing practical models that "get the job done". And just as computer designers might construct icon based interfaces to help make it easier for humans to understand and practically manage information, our own minds actually set out to construct "practical" percepts in order to help us simplify what we do. But the mental models, Hoffman claims need not bear any resemblance to what is being modeled.

To be sure, Hoffman may say the percepts -- mental models -- a conscious entity holds bear no resemblance to their referents, but he doesn't claim that there is no correlation to them. Hoffman says that user interfaces, including our own consciousnesses, by design have the following characteristics:

  • Friendly formatting
  • Concealed causality
  • Clued conduct
  • Ostensible objectivity

That is, a user interface's "purpose" is to distill immensely complex behaviors down to practical "icons" of objects and behaviors that stand for that underlying complexity, but don't literally mirror it. Take the file-delete example. The icon on the desktop is a sufficient stand-in for a file, even though the file, a pattern of magnetic fields on a metal platter, bears no resemblance to the icon. It's a "friendly format", in this sense. Further, the action of dragging and dropping it onto a trash can icon to "delete it" has its own causal chain, which conceals the true, deeply complex causal chain that actually happens to effect the file delete operation. Yet the drag-n-drop operation and the trash can icon give an intuitive clue of what will happen if something is dropped onto it. Finally, this drag-n-drop-to-delete operation is designed to consistently do the same thing every time, thereby engendering in the user an ostensible sense that there is an objective operation going on that will always happen, even though a moment's reflection tells us that a failure in the underlying software or hardware could cause something else to happen when one drops a file icon on the trash icon.

So far, I can see that there's a practical use for this notion to people trying to understand human perception or to engender consciousness in machines. For one, the claim is that percepts do not have to bear much resemblance to their referents in the "real world". They just have to have practical utility. An icon in a user interface just needs to be useful enough for the user to be aware of a file's existence and to do some basic stuff with it. Similarly, the mental percept an antelope has of a lion in the distance only needs to be useful enough to stay alive to be useful. It doesn't need to be a highly detailed representation of the lion beyond that basic utility. It also alludes to the view that a high fidelity representation in a computer of the "real world" doesn't make the machine that has it any more aware of what is represented. For instance, just because a self-driving car has a 3D map of the terrain out in front doesn't mean it can "see" where the road is. It's still necessary to create a practical model of how the world works that uses this 3D representation as source data, like an algorithm that seeks basically level ground, defined by a threshold of variation that separates level from non-level ground. If this were the message of the paper, I would say it adds genuine value: a set of concepts and terms to use to help steer people away from fallacious assumptions about how consciousness works and to suggest paths for further study.

But this isn't where the paper ends. It's more where it starts. In fact, this paper is less about explaining how consciousness works than about how reality works; it's metaphysics instead of epistemology. As stated earlier, it starts with the assumption that consciousness exists and that the subject of consciousness is optional. To avoid sounding like a total subjectivist, Hoffman states that:

    If your MUI functions properly, you should take its icons seriously, but not literally. The point of the icons is to inform your behavior in your niche. Creatures that don't take their well-adapted icons seriously have a pathetic habit of going extinct.

If Hoffman accepts the idea that there is a physical, objective reality, what is it composed of? "Conscious Realism asserts the following: The objective world, i.e., the world whose existence does not depend on the perceptions of a particular observer, consists entirely of conscious agents." Honestly, I would love to say that this claim is explained, but it really isn't. Hoffman claims that humans are not the only conscious agents, but doesn't say that tables, apples, and such are conscious, per se. "According to conscious realism, when I see a table, I interact with a system, or systems, of conscious agents," which really does seem to suggest that the table is conscious, but not clearly.

    Conscious realism is not panpsychism nor entails panpsychism. Panpsychism claims that all objects, from tables and chairs to the sun and moon, are themselves conscious (Hartshorne, 1937/1968; Whitehead, 1929/1979), or that many objects, such as trees and atoms, but perhaps not tables and chairs, are conscious (Griffin, 1998).Conscious realism, together with MUI theory, claims that tables and chairs are icons in the MUIs of conscious agents, and thus that they are conscious experiences of those agents. It does not claim, nor entail, that tables and chairs are conscious or conscious agents.

This is one of the problems I have with this paper, though. Although Hoffman rejects the notion of inanimate objects as conscious in a trippy, Disney cartoon sense, he doesn't really elaborate on what he does mean. Moreover, if a table is labeled as conscious in order to stick a placeholder for a physical object in the objective world, what value does this add over the simpler, more intuitive conception of the table as being a physical object? It almost seems as though, in order to come up with a rigorous, clean-cut, math-friendly theory of how consciousness constructs perceptions of the world, Hoffman throws the baby out with the bathwater by claiming that even though there is an objective world, it is not composed of actual objects.

I think if Hoffman were inclined to speak of "conscious realism" and "multimodal user interfaces" as tools and techniques for studying consciousness and guides to creating it, this could be a practical concept. He could say that our perceptions of reality really do reflect, if simplistically, abstractly, and practically, an actual, objective reality. By taking pains to say there isn't really one -- or that it is entirely disconnected from our ability to perceive it -- this paper seems to do something of a disservice to science:

    We want the same [approach] for all branches of science. For instance we want, where possible, to exhibit current laws of physics as projections of more general laws or dynamics of conscious agents. Some current laws of physics, or of other sciences, might be superseded or discarded as the science of conscious realism advances, but those that survive should be exhibited as limiting cases or projections of the more complete laws governing conscious agents and their MUIs.

While I can see that it is possible, perhaps, to express other branches of science in the terminology of MUIs, I don't see how it would advance our understanding of their subject matter. Gravity was well understood by Newton, yet expressing it in terms of the theory of General Relativity makes it possible to do more with the subject matter than was possible in the purely Newtonian framework. What new insights will the physicist have as a result of expressing gravity in terms of multimodal user interfaces and with reference to heavenly bodies as conscious entities? If anything, it sounds more like this extra layer would only add to the confusion people have in trying to understand already complex concepts and could even potentially take away certain practical conceptual tools. So I don't see the point.

All that said, the MUI concept does seem to add value to my own way of thinking of perception. The four functions of a good user interface listed above (friendly formatting, concealed causality, clued conduct, ostensible objectivity) seem to shout out how scientists trying to engender perception in machines should frame their goals and concepts. But the rest of Hoffman's paper, which dabbles in the philosophy of what reality is, seems to have little use for AI research.

Wednesday, July 4, 2007

Plan for video patch analysis study

I've done a lot of thinking about this idea of making a program that can characterize the motions of all parts of a video scene. Not surprisingly, I've concluded it's going to be a hard problem. But unlike other cases where I've smacked up against a brick wall, I can see what seems a clear path from here to there. It's just going to take a long time and a lot of steps. Here's an overview of my plan.

First, the goal. The most basic purpose is to, as I said above, make a program that can characterize the motions of all parts of a video scene. The program should be able to fill an entire scene with "patches". Each patch will lock onto the content found in that frame and follow it throughout the video or until it can no longer be tracked. So if one patch is planted over the eye of a person walking through the scene, the patch should be able to follow that eye for at least as long as it's visible. Achieving this goal will be valuable because it will provide a sort of representation of the contents of the scene as fluidly moving but persistent objects. This seems a cornerstone of generalized visual perception, which has been entirely lacking in the history of AI research.

One key principle for all of this research will be the goal of constructing stable, generic views, elaborated by Donald D. Hoffman inVisual Intelligence. The dynamics of individual patches will be very ambiguous. Favoring stable interpretations of the world will help patches to make smarter guesses, especially when some lines of evidence strongly suggest non-stable ones.

One obvious challenge is when a patch falls on a linear edge, like the side of a house, instead of a sharp point, like a roof peak. Even more challenging will be patches that fall on homogenous textures, like grass, where independent tracking will be very difficult. It seems clear that an important key to the success of any single patch tracking its subject matter will be cooperating with its neighboring patches to get clues about what its own motion should be. Patches that follow sharp corners will have a high degree of confidence in their ability to follow their target content. Patches that follow edges will be less certain and will rely on higher confidence patches nearby to help them make good guesses. Patches that follow homogeneous textures will have very low confidence and will rely almost exclusively on higher confidence patches nearby to make reasonable guesses about how to follow their target content.

The algorithms for getting patches to cooperate will be a big challenge as it is. If the patches themselves aren't any good at following even strong points of interest, working on fabrics of patches will be a waste of time. Before any significant amount of time is spent on patch fabrics, I intend to focus attention on individual patches. A patch should be able to at least follow sharp points of interest. It should also be able to follow smooth edges laterally along the edge, like a buoy bobbing on water. Even this is a difficult challenge, though. Video of 3D scenes will include objects that move toward and away from the camera, so individual patches' target contents will sometimes shrink or expand. Nearby points of interest that look similar can confuse a patch if the target content is moving a lot. Changes in lighting and shadow from overcast trees, rotation, and so on will pose a huge challenge. Some of the strongest points of interest lie on outer edges of 3D objects. As such an object moves against its background, part of the patch's pattern will naturally change. The patch needs to be able to detect its content as an object edge and learn quickly to ignore the background movements.

It's apparent that solving each of these problems will require a lot of thought, coding, and testing. Also, that these components may well work against each other. It's going to be important for the patch to be able to arbitrate differing opinions among the components about where to go with each moment. How best to arbitrate is a mystery to me at present. It seems logical, then, to begin my study by creating and testing the various analysis components of a single patch.

Once I have some better definition of the analysis tools a patch will have at its disposal for independent behavior, I should then have a tool kit of black-boxes that an arbitration (and probably learning) algorithm can work with. Once I have a patch component that can do many analyses and come up with good guesses about the dynamics of its target content, then I can move on to constructing "fabrics" of patches so the patches can rely on their neighbors for additional evidence. The individual patches, if they have a generic arbitration mechanism, can use additional information from neighbors as just more evidence to arbitrate with.

I have made a conscious choice this time not to worry about performance. If it takes a day to analyze a single frame of a video, that's fine. *shudder* Well, I probably will try to at least make my research tolerable, but the result of this will almost certainly not be practical for real-time processing of video using the equipment I have on hand. However, I believe that if I am successful at least in proving the concept I'm striving for and thus advancing research into visual perception in machines, other programmers will pick apart the algorithms and reproduce them in more efficient ways. Further, it is very clear to me that individual patches are so wonderfully self-contained that it will be possible to divvy out all the patches in a scene to as many processors as we can throw at the problem. This means that if one can make a patch fabric engine that processes one frame per second using a single processor, it should be fairly easy to make it process 30 frames per second with 30 processors.

I am also dispensing somewhat with the goal of mimicking human vision with this project. I do believe a lot of what I'm trying to do does go on in our visual systems. I don't have strong reason to believe, though, that we have little parts of our brains devoted to following patches wherever they will go as time passes. That doesn't seem to fit the fixed wiring of our brains very well. It may well be that we do patch following of a sort that lets the patch slide from neural patch to neural patch, which may imply some means of passing state information along those internal paths. I can hypothesize about that, but really, I don't know enough yet to say that this is literally what happens in the human visual system. I think it's enough to say that it could.

So that's my current plan of research for a while. I have to do this in such small bites that it's going to be a challenge keeping momentum. I just hope that I've broken the project up into small enough bites to make significant progress over the longer term.

Sunday, July 1, 2007

Patch mapping in video

Over the weekend, I had one of them epiphany thingies. Sometime last week, I had started up a new vision project involving patch matching. In the past, I've explored this idea with stereo vision and discovering textures. Also, I opined a bit on motion-based segmentation here a couple of years ago.

My goal in this new experiment was fairly modest: plant a point of interest (POI) on a video scene and see how well the program can track that POI from frame to frame. I took a snippet of a music video and captured 55 frames into separate JPEG files and made a simple engine with a Sequence class to cache the video frames in memory and a PointOfInterest class, of which the Sequence object would have a list, all busy following POIs. The algorithm for finding the same patch in the next frame is really simple and only involves summing up the red, green, and blue pixel value differences in candidate patches and accepting the candidate with the lowest difference total; trivial, really. When I ran the algorithm with a carefully picked POI, I was stunned at how well it worked on the first try. I experimented with various POIs and different parameters and got a good sense for its limits and potentials. It got me really thinking a lot about how far this idea can be taken, though. Following is a sample video that illustrates what I experimented with. I explain more below. You may want to stop the video here and open it in a separate media player while you read on in the text.


Click here to open this WMV file

I specifically wanted to show both the bad and the good of my algorithm with the above video. After I played a lot with hand-selected POIs, I let the program pick POIs based on how "sharp" regions in the image are. I was impressed at how my simple algorithm for that worked, too. As you can see, in the first frame, 20 POIs (green squares) are found at some fairly high contrast parts of the image, like the runner's neck and the boulders near the horizon. As you watch the video loop, start by watching how well the POIs on the right brilliantly follow with the video. The ones that start on the runner quickly go all over the place and "die" because they can no longer find their intended targets. Note the POIs in the rocks that get obscured by the runner's arm, though. They flash red as the arm goes by, but they pick up again as the arm uncovers them. Once a POI loses its target, it gets 3 more frames to try, during which it continues forward in the same velocity as before, and then it dies if it doesn't pick it up again. Once the man's leg covers these POIs, you can see them fly off in a vain search for where the POIs might be going before they die.

I don't want to go into all the details of this particular program because I intend to take this to the next logical level and will make code available for that. I thought it useful just to show a cute video and perhaps mark this as a starting point with much bigger potential.

Although I thought of a bunch of ways in which I could use this, I want to indicate one in particular. First, my general goal in AI these days is generally to engender what I refer to as "perceptual level intelligence". I want to make it so machines can meaningfully and generally perceive the world. In this case, I'd like to build up software that can construct a 2D-ish perception of the contents of a video stream. My view is that typical real video contains enough information to discern foreground from background and whole objects and their parts as though they were layers drawn separately and layered together, as with an old fashioned cel-type animation. In fact, I think it's possible to do this without meaningfully recognizing the objects as people, rocks, etc.

I propose filling the first frame of a video with POI trackers like the ones in this video. The ones that have clearly distinguished targets would act like anchor points. Other neighbors that would be in more ambiguous areas -- like the sky or gravel in this example -- would rely more on those anchors, but would also "talk" to their neighbors to help correct themselves when errors creep in. In fact, it should be possible for POIs that become obscured by foreground objects to continue to be projected forward. In the example above, it should actually be possible, then, to take the resulting patches that are tagged as belonging to the background and actually reproduce a new video that does not include the runner! And then another video that, by subtracting out the established background, contains only the runner. This would be a good demonstration of segmenting background and foreground.

It should also be possible for these POIs to get better and better at predicting where they will go by introducing certain learning algorithms. In fact, it's possible the POI algorithm could actually start off naive and come to learn how to properly behave on its own.

The key to both this latter dramatic feat and the other earlier goals is an idea I gleaned from Donald D. Hoffman's Visual Intelligence. One idea he promotes repeatedly in this book is the importance of "stable" interpretations of visual scenes. His book dealt primarily in static images, but this idea is powerful. Here's an example of what I mean. Watch the gravel in the video above. Naturally, gravel that is lower in the video is closer to you and thus slides by faster than the gravel higher up and thus farther away. Ideally, POI patches following this gravel would move smoothly so that higher up levels would slide slowly and lower down would slide more quickly. (To be sure, this video would have to be normalized to correct for the camera being so jumpy.) If one patch in this "stream" of flow were to think it should suddenly jut up several pixels while its neighbors are all slowly drifting to the left, this would not seem to fit a "stable" interpretation of this one patch being part of a larger whole or of it following a smooth path at a fairly consistent pace. We assume the world rarely has sudden changes and thus prefer these smooth continuations.

In chapter 6 of Visual Intelligence, Hoffman addresses motion specifically and, while he doesn't talk about patch processing like this, does introduce a bunch of interesting rules for perception. Here are some of them that relate here:

  • Rule 29. Create the simplest possible motions.
  • Rule 30. When making motion, construct as few objects as possible, and conserve them as much as possible.
  • Rule 31. Construct motion to be as uniform over space as possible.
  • Rule 32. Construct the smoothest velocity field.

The idea of stable interpretations can come into play with POIs that are following boundaries of foreground objects, like the runner in this example. My POIs failed to follow in part because, while the "inside" part of the patch was associated with the man's head, for example, the "outside" would be associated with the background, which might be constantly changing as the head moves forward in space. In fact, the "outside" (background) part of such a POI should generally be "unstable", while the "inside" (foreground) stays stable. That assumption of instability of background as it constantly is obscured or uncovered by the foreground is a rule that should be helpful both in getting POIs to track these edges, but also in detecting these edges in the first place and thus segmenting foreground objects from background ones.

As far as patches learning how to make predictions autonomously, here's where the concept of stable interpretations really shines. The goal of the learning process should be to make a POI algorithm that forms the most stable interpretations of the world. Therefore, when comparing two possible algorithmic changes -- perhaps using a genetic algorithm -- the fitness function would be stability itself. That is, the fitness function would measure the fidelity of the matches, how well each POI sticks with its neighbors, how well it finds foreground / background interfaces (against human-defined standards, perhaps), and so on.

There's so much more that could be said on this topic, but my blogging hand needs a break.

Wednesday, June 27, 2007

Emotional and moral tagging of percepts and concepts

Back in April, I suffered head trauma that almost killed me and landed me in the hospital for, thankfully, only a day. My wife, the sweet prankster that she is, went to a newsstand and got me copies of Scientific American Mind and Discover Presents: The Brain, an Owner's Manual (a one-off, not a periodical). The former had a picture of a woman with the upper portion of her head as a hamburger and the latter a picture of a head with its skullcap removed revealing the brain. So I got a good laugh and some interesting reading.

I'm reading an article now in The Brain titled "Conflict". The basic position author Carl Zimmer offers is encapsulated in the subtitle: morality may be hardwired into our brains by evolution. In my opinion, there is some merit to this idea, but I don't subscribe wholeheartedly to all of what the article promotes. Zimmer argues that the parts of our brains that respond emotionally to moral dilemmas are different from the parts that respond rationally and that, in fact, the emotional responses often happen faster than the intellectual ones. He further contends that our moral judgments come out of these more primitive, instant emotional responses. I have thought this as well, but not for the reason Zimmer proffers: that moral reasoning is automatic and built in.

I'd agree that, yes, we are reacting automatically and almost instantly, emotionally and moralistically, before we start seriously analyzing a moral question. But I would argue that it's because one's "moral compass" is programmable, but largely knee-jerk. Most humans may be born with some basic moral elements, like empathy and a desire to not see or let other people suffer. But we can readily reprogram this mechanism to respond instantly to things evolution obviously didn't plan for. For example, most Americans recognize the danger smoking poses to health. So smoking around other people comes with an understanding that it's a danger to their health, and often without their consenting to the risks. That knowledge quickly becomes associated with the "second-hand smoke" concept. I would argue that people with this knowledge instantly respond emotionally and moralistically when the subject of second-hand smoking comes up, regardless of the content of the conversation in which it's referenced. Even before the sentence is completely uttered, the moral judgments and emotional indignation are kicking in in the listener's mind. Why is this?

The article just prior to this one by Steven Johnson and titled "Fear" points out that the amygdala is activated when the brain is responding to "fear conditioning", as when a rat is trained to associate a sound tone with electric shock.

Johnson cites a fascinating case of a woman who suffered a tragic case of short term memory. Her doctor could leave for 15 minutes and return and the woman would not recognize him or recall having any history or relation to the doctor. Each time they met, he would shake her hand as part of the greeting ritual. One day, he concealed a tack in his hand when he went to shake her hand. After that, while she still did not recognize the doctor in any conscious way, she no longer wished to shake his hand. In experiments with rats, researchers found that removing the part of the neocortex that remembers events did not stop the rats from continuing to respond to fear conditioning. On the other hand, removing the amygdala did seem to take away the automatic fear reaction they had learned, even if they could remember events associated with their fear conditioning.

Johnson leaves open the question of whether the amygdala is actually storing memories of events for later responses versus simply being a way of "tagging" memories stored in other parts of the brain. My opinion is that tagging makes more sense. Imagine some part of your cortex stores the salient facts associated with some historical event that was traumatic. If the amygdala has connections to that portion of the cortex, they could be strengthened in such a way that anything that triggers memories of that event would also activate the amygdala via that strong link. If the amygdala is really just a part of the brain that kicks off the emotional responses the body and mind undergo, this seems a really simple mechanism for connecting thoughts with emotions.

In the hypothetical example I gave earlier, there could be a strong link between the "second-hand smoke" concept and the amygdala (or some other part of the brain associated with anger). So anything that activates those neurons would also trigger an instant emotional response that would become part of the context of the conversation or event.

I would propose the inclusion of this sort of "tagging" of the contents of consciousness (or even subconsciousness) for just about any broad AI research project. Strong emotions tend to be important in mediating learning. We remember things that evoke strong emotions, after all, and more easily forget things that don't. That has implications for learning algorithms. But conversely, memories of just about any sort in an intelligent machine could come with emotional tags that help to set the machine's "emotional state", even when that low-level response seems incongruous with the larger context. For example, a statement like "we are eliminating second-hand smoke here by banning smoking in this office" might be intended to make a non-smoker happy, but the "second-hand smoke" concept, by simply being invoked, might instantly add a small anger component to the emotional soup of the listener. That way, when the mind recognizes that the statement is about a remedy, the value of the remedy is recognized as proportional to the anger engendered by the problem.

Although I haven't talked much about moralistic tagging, per se, I guess I'm assuming that there is a strong relationship between how we respond emotionally to things and how we view their moral content. To be sure, I'm not suggesting that one's ethical judgments always (or should always) jibe with one's knee-jerk emotional reactions to things. Still, it seems this is somewhat a default for us, and not a bad starting point for thinking about how to relate moral thinking to rational thinking in machines.

Being able to tag any particular percepts or concepts learned (or even given a priori) may sound circular, mainly because it is. Emotions beget emotions, as it were. But there are obvious bootstraps. If a robot is given "pain sensors" to, say, detect damage or potential damage, that could be a source of emotional fear and / or anger.

These emotions, in addition to affecting short-term planning, could also be saved with the memory of a damage event and even any other perceptual input (e.g., location in the world or smells) available during that event. Later, recalling the event or detecting or thinking about any of those related percepts could trigger the very same emotions, thus affecting whatever else is the subject of consideration, including affecting its emotional tagging. In this way, the emotions associated with a bad event could propagate through many different facets of the machine's knowledge and "life". This may sound like random chaos -- like tracking mud into a room and having other feet track that mud into other rooms -- but I would expect there to be natural connections from state to state, provided the machine is not prone to random thinking without reason. I think putting "tracers" in such a process and seeing what thoughts become "infected" would be fascinating fodder for study.

Friday, June 22, 2007

A hypothetical blob-based vision system

As often happens, I was talking with my wife earlier this evening about AI. Given that she's a non-programmer, she's an incredible sport about it and really bright in her understanding of these often arcane ideas.

Because of some questions she was asking, I thought it worthwhile to explain the basics of classifier systems. Without going into detail here, one way of summarizing them is to imagine representing knowledge of different kinds of things in terms of comparable features. She's a "foodie", so I gave the example of classifying cookies. As an engineer, you might come up with a long list of the things that define cookies; especially ones that can be compared among lots of cookies. Like "includes eggs" or a degree of homogeneity from 0 - 100%. Then, you describe each kind of cookie in terms of all these characteristics and measures. Some cookie types will have a "not applicable" or "don't care" value for some of these characteristics. So when confronted with an object that has a particular set of characteristics, it's pretty easy to figure out which candidate object types best fit this new object and thus come up with a best guess. One could even add learning algorithms and such to deal with genuinely novel kinds of objects.

I explained classifier systems to my wife in part to show that they are incomplete. Where does the list of characteristics of the cookie in question come from? It's not that it's not a useful thing, but that it lacks the thing that most all AI system ever made to date lack: a decent perceptual faculty. Such a system could have cameras, chemical analyzers, crush sensors, and all sorts of things to generate raw data, and that might give us enough characteristics to classify cookies. But what happens when the cookie is on a table full of food? How do we even find it? AI researchers have been taking the cookie off the table and putting it on the lab bench for their machines to study for decades, and it's a cheap half-solution.

Ronda naturally asked if it would be possible to have the machine come up with the fields in the "vectors" -- I prefer to think in terms of matrices or database tables -- on its own, instead of having an engineer hand craft those fields? Clever. Of course, I've thought about that and other AI researchers have gone there before. We took the face recognition problem as a new example. I explained how engineers define key points on faces, craft algorithms to find them, and then build a vector of numbers that represent the relationships among those points as found in pictures of faces. The vector can then be used in a classifier system. OK, that's the same as before. So I imagined the engineer instead coming up with an algorithm to look for potential key points in a set of pictures of 100 people's faces. It could then see which ones appear to be repeated in many or most faces and throw away all others. The end result could be a map of key points that are comparable. Those are the fields in the table. OK. So a program can define both the comparable features of faces and then classify all the faces it has pictures of. Pretty cool.

But then, there's that magic step, again. We had 100 people sit in a well-lit studio and had them all face forward, take off their hats and shades, and so on. We spoon fed our program the data and it works great. Yay. But what about the real world? What about when I want to find and classify faces in photographs taken at Disneyland? That's a new problem and starts to bring up the perception question all over again.

At some point, as we were talking over all this, I put the question: let's say your practical goal for a system is to be able to pick out certain known objects in a visual scene and keep track of them as they move around. How can you do this? I was reminded of the brilliant observations Donald D. Hoffman laid out in his Visual Intelligence book, which I reviewed on 5/11/2005. Among other things, Hoffman observed that, given a simple drawing representing an outline of an object, it seems we look for "saddle points" and draw imaginary lines to connect them and end up with lots of simpler "blob" shapes. I went further to suggest that this could be a way to segment a complex shape in such a way that it can be represented by a set of ellipses. The figure below shows a simple example:



I drew a similar outline in a sandbox at a playground we were walking by and asked her to segment it using these fairly simple rules. Naturally, she got the concept easily. From there, we asked how you could get to the clean line drawings to do the segmenting. After all, vision researchers have been banging their heads against the wall trying to come up with clean segmentation algorithms like this for decades.

I described the most common trick vision researchers have in their arsenal of searching static images for sharp contrasts and approximating lines and curves along them. Not surprisingly, these don't often yield closed loops. That's why I had experimented with growing "bubbles" (see my blog entry and project site) to ensure that there were always closed loops, on the assumption that they would be easier to analyze later than disconnected lines. Following is an illustration:



I found that somewhat unsatisfying because it relies very much on smooth textures, whereas life is full of more complicated textures that we naturally perceive as continuous surfaces. So we batted around a similar idea in which we could imagine "planting" small circles on the image and growing them so long as the image included within the circle is reasonably homogeneous, from a texture perspective. Scientists are still struggling to understand how it is we perceive textures and how to pick them out. I like the idea of simply averaging out pixel colors in a sample patch to compare that to other such patches and, when the colors are sufficiently similar, assume they have the same texture. Not a bad starting point. So imagine segmenting a source image into a bunch of ellipses, where each ellipse contains as large a patch of one single texture as reasonably possible. Why bother?
These ellipses -- we'll call them "blobs" for now -- carry usable information. We switched gears and used hand tools as our example. Let's say we want to learn to recognize hammers and wrenches and such and be able to tell one from another, even when there are variations in designs. Can we get geometric information to jibe with the very one-dimensional nature of databases and algebraic scoring functions? Yes. Our blobs have metrics. Each blob has X / Y coordinates and a surface area; we'll call it its "weight". So maybe in our early experiments, we write algorithms to learn how to describe objects' shapes in terms of blobs, like so:



Step 3 is interesting, in that it involves a somewhat computation-heavy analysis of the blobs to see how we can group together bunches of small blobs into "parts" so we can describe our tools in terms of parts; especially if those parts can be found on other tools. In step 4, we use some algorithm to rotate the image (and blobs and parts) so we have them in some well-defined "upright" orientation and stretch it all out so it fits some fixed-sized box, which makes it easier to compare other objects, regardless of their sizes and orientations. In step 5, we look for connections among blobs to help show how they are related. Now, all of these steps are somewhat fictional. They're easy to draw on paper and hard to code. Still, let's imagine we come up with something that basically works for each.

Now, when we see other tools laid out on our bench, we can do the same sorts of analyses and ultimately store the abstract representations we come up with. Perhaps for each object, we store a representation of its parts. One would be picked -- perhaps the center-most -- as the "root" and all the other parts would be available via links to their information in memory. Walking through an object definition would be like following links on web pages. Each part could be described in terms of its smaller parts, and, ultimately, blobs. Information like the number, weights, and relative positions or orientations of blobs and parts to one another can be stored and later compared with those of other candidate objects.

Now here's where things can get interesting. The next step could be to take our now-learned software out into a "real world" environment. Maybe we give it a photograph of the wrench in a busy scene. We segment the entire scene into blobs, as before. But this time, we do an exhaustive search of all combinations of blobs against all known objects' descriptions.

At this point, the veteran programmer has the shakes over the computation time required for all this. Get over it and pretend other engineers work on optimizing it all later. And besides, we have an infinitely fast computer in our thought experiment; something every AI researcher could use.

It starts seeming like we can actually do this; like we can have a system that is capable of actually perceiving hand tools in a busy scene. Maybe our next step is to feed video to the program, where a camera pans across the busy scene. This time, instead of our program looking at each individual frame as a whole new scene, we start with the assumption of object persistence. In frame 1, we found the wrench. In frame 2, we search for the wrench immediately at the same place. Once we found the wrench in frame 1, we worked back down to the source image and picked out the part of the bitmap that is strongly associated with the wrench and try doing a literal bitmap match in frame 2 around the area it was in frame 1. Sure enough, we find it, perhaps just a little to the right. We assume it's the same wrench. So now, we've saved a lot of computation by doing more of a "patch match" algorithm.

Now we not only have our object isolated, but we also now have information about its movement in time and can make a prediction about where it might be in frame 3. Maybe in frame 1, we found 2 wrenches and 1 hammer. Maybe as we track each one's movement from frame to frame, we look to see if it's all consistent in such a way that suggests maybe the camera is moving or that they are all on the same table or otherwise meaningfully related to one another in their dynamics. New objects might be discovered, as well, using "learning while performing" algorithms like I described in a recent blog entry. So much potential is opened up.

I don't mean to suggest this is exactly how a visual perception algorithm should work. I just loved the thought experiment and how it showed how engineers could genuinely craft a system that can truly perceive things. And it illustrates a lot of features I consider highly valuable, like learning, pattern invariance, geometric knowledge, hierarchic segmentation of objects into "parts", bottom-up and top-down processes to refine percepts, object permanence, and so on.

Now, about the code. I'll have to get back to you on that.

Saturday, April 21, 2007

Abstraction in neuron banks

[Audio Version]

On an exhilarating walk with my wife, we discussed the subject of how to build on the lessons I learned from my Pattern Sniffer project and its "neuron bank", documented in my previous blog entry. There are loads of things to do and it was not obvious how to squeeze more value out of what little I've done so far. But it finally became apparent.

One thing that I was not happy about with Pattern Sniffer is that the world it perceives is "pure". There is just one pattern to perceive at a time. The world we perceive is rarely like this. As I walk along, I hear a bird singing, a car, and a lawn mower at the same time and am aware of each, separately. Clearly, there is lots of raw information overlap, yet I'm able to filter these things out and be aware of all three at once. Pattern Sniffer could see two things going on in its tiny 5 x 5 pixel visual field, but it would see them as a single pattern. This is the kind of sterile world so many AI systems live in because the experimenters don't know how to rise above this problem. Yet rising above is a requirement if we want to be able to get machines that can exist at the "perceptual level", and not just the "sensory level" of intelligence.

I said in my previous blog entry that my neurons' dendrites had a "care" property, but that I didn't make use of it yet. My vision was that this would play an important role in being able to recognize patterns in a more abstract way, but I didn't know how, yet. I need to get to work and document my results, but I wanted to document some of the thoughts we came up with that I can now practically explore.

As we walked, I pointed at a car and explained that somehow, I'm able to "mask out" all the not-car parts of the scene and focus only on the car part. It's very hard to explain what that means, but I tried to relate it in terms of my neuron banks. Consider the "left bar" pattern:


"Left Bar" pattern.


What if we had a neuron in a bank that could recognize this pattern. But let's say I have another neuron that's a copy of this, save for one thing: each dendrite that now expects white pixels now doesn't actually care what's in the white area. We'll represent "don't care" pixels (dendrites) with blue diagonal stripes, like so:


"Left Bar" pattern with white pixels replaced by "don't care" pixels.


In this case, I'm assuming the "care" property would be a numeric value, from 0 (don't care) to 1 (care very much), multiplied while calculating the strength of the match on that dendrite that ultimately contributes to the total match score for the neuron. Now let's say the neuron bank is confronted by a perfect left bar pattern. Clearly, the neuron with the "solid" left bar pattern, with all dendrites having care = 1, will get a stronger match than the neuron with the "masked" version of the left bar pattern, because the don't-care dendrites will not contribute positively to the match score. So if only one neuron gets to "win" this matching game, the neuron with the solid left bar pattern will always win.


An exact match trumps a masked match.


But now let's say we showed our neuron bank an "L" shaped pattern. The "masked" left bar pattern is going to fare better than the "solid" left bar, like so:


The "don't care" pixels don't get penalized by the "lower bar" part.


Now let's say we also had "bottom bar" neurons that match both the solid and masked versions of that. Things get interesting with the "L" pattern. Let's say we even have a neuron that has learned the solid "L" pattern. Following illustrates these variations:


The "L" neuron has the best match, followed by the masked left and bottom bar.


OK, so if we have a neuron that already has a strong match of the "L" pattern, what good are the masked left and bottom bar? Here's where having a neuron hierarchy comes in handy. If we are regularly seeing left bars, bottom bars, and L patterns, a higher level neuron bank could potentially see that the masked-pattern neurons match more things than the solid-pattern neurons do and thus find them to be more generally useful than the specific-pattern neurons. It could then reward them by encouraging them to gain confidence, even though they are not the best matches.

One thing my current neuron banks assume is that there is only one single best match and that only that one neuron gets rewarded for matching a pattern, while all the others may in fact be penalized. Yet this doesn't seem to fit how our brains work, at some level. Remember: I said I can hear and be aware of a bird singing, a car, and a lawn mower at the same time. That's what I want my software to do, too. See, if we're regularly seeing left bars and bottom bars, it may just be that, when we see an "L" in the input, that it's actually just a left bar and a bottom bar, seen together. That's another interpretation.

Being able to explain the total input in terms of multiple perceived stimuli must be more "satisfying" to certain parts of our brains than alternative explanations that see the input as all part of a single cause that is not currently known. Being able to engender this could bring a machine a lot closer to the perceptual level of intelligence.

So that's what I'm probably going to study next. One challenge will be figuring out how to deal with allowing multiple neurons to be rewarded for doing the right thing in a given moment without encouraging neurons to learn redundant information. We'll see.

Thursday, April 12, 2007

Pattern Sniffer: a demonstration of neural learning

[Audio Version]

Table of contents



Introduction

For over a year, I've been nursing what I believe is a somewhat novel concept in AI that superficially resembles a neural network and is inspired by my read of Jeff Hawkins' On Intelligence. Recently, I finally got around to writing code to explore it. I was deeply surprised by how well it already works that I thought it worthwhile to write a blog entry introducing the concept and make public my source code and test program for independent review. For lack of putting any real thought into it, I just named the project / program "Pattern Sniffer".

My regular readers will recognize my frequent disdain for traditional artificial neural networks (ANNs), not only because they do not strike me as being anything like the ones in "real" brains, but also because they seem to fail miserably at displaying anything like "intelligent" behavior. So it's with reluctance that I call this a neural network. The test program I made, however, has only one "layer" of neurons, which I call a "neuron bank". I did not wish, yet, to demonstrate a hierarchy and multi-level abstraction, though. My main goal was to focus specifically on a very narrow but almost completely overlooked topic in artificial intelligence: unguided learning.

Unguided learning

All artificial neural networks I have ever seen or read about rely on a so-called "training phase", where they are exposed to examples of certain patterns they are supposed to be able to recognize in the future before they are ever put out into the "real world". I was disappointed when I finally read of how Numenta's Hierarchic Temporal Memories (HTMs) undergo the same sort of learning process before they can begin recognizing things in the world. This smacks in the face of how humans and other mammals and, indeed, all creatures on Earth that can learn work.

Does intelligence require that an intelligent being continue to learn once it enters a productive life? I think the answer is obviously "yes". What's more, it's tempting for us to think humans rarely go through learning, as in their school years, and spend most of their lives in a basic "production" mode. Yet I would argue that every moment we are awake, we are learning things. Most of it is quickly forgotten. We use the terms "short term memory" and "working memory" to identify this, which seems to suggest we have something like computer RAM, while the real long-term memory is packed away into a hard drive.

I'm no expert in neurobiology, so I may be missing some important information. But the idea of information being transferred in packages of data from one part of the brain to another for long term storage doesn't seem to jibe with my limited understanding of how our brains work. Why, for example, should learning a phone number long enough to dial it occur in one part of the brain while learning it for long term use, like with our own home numbers? And how would it be transferred?

What if it's the same part of the brain learning that phone number, whether for short or long term usage? Perhaps the part of my brain that is most directly associated with remembering phone numbers has some neurons that have learned some important phone numbers and will remember them for life, while it contains other neurons that have not learned any phone numbers and are just eagerly awaiting exposure to new ones that may be learned for a few seconds, a few minutes, or a few years.

Finite resources

We are constantly learning. Yet we have a finite amount of brain matter. Somehow we must have some mechanism for deciding what information we are exposed to is important enough to retain long term and which is only worth retaining for a moment.

When I studied how Numenta's HTMs learn, I was a bit disappointed to see that, while there is a finite and predetermined number of nodes in an HTM, the amount of memory required for one is variable. This is like many other kinds of classifier systems and other learning algorithms. This does make some sense from an engineering perspective, but it does not seem to fit what I understand of how our brains work. Our neurons may change the number and arrangement of dendritic connections, but it's a far cry from keeping a long list of learned things inside. So far, it seems ANNs are one of the only classes of learning systems out there that do use a finite and predefined amount of memory in learning and functioning.

I believe that, for some functional chunk of cortical tissue, there is a fixed number and basic arrangement of neurons and they all are doing basically the same thing, like learning, recognizing, and reciting phone numbers. It seems intuitive to believe that that chunk has its own way of deciding how to allocate its neurons to various numbers, with some being locked down, long term, and others open to learning new ones immediately for short term use. Any one of these may also eventually become locked down for the long term, too.

I also believe it's possible, though not certain, that some neurons that have learned information for the long term may occasionally have that information decay and be freed up to learn new things.

Competing to be useful

When I started thinking about banks of neurons working in this way, I naturally asked the question: how does the brain decide what is important to learn and how long to retain it? It then occurred to me that there may be some kind of competition going on. What if most of the neurons in the cortex "want" more than anything to be useful? What if they are all competing to be the most useful neuron in the entire brain?

Let's start with the assumption that all neurons in a neuron bank all have access to the same input data. And let's say each neuron wishes to be most useful by learning some important piece of information. You would think that the first problem to arise would be that they would all learn the exact same piece of information and thus be redundant. But what if, when one neuron learns a piece of information, the others could be steered away from learning the same information? What if every neuron was hungry to learn, but also eager to be unique among its peers in what it knows?

But how could one neuron know what its peers know? Would that require an outside arbiter? An executive function, perhaps? Not necessarily. It's possible that each neuron, when it considers the current state of input, decides how closely that input matches its own expected pattern that it has learned, "shouts out" how strongly it considers the input to match its expectation. The other neurons in the bank could each be watching to see which neuron shouts out the loudest and assume that neuron is the most likely match. Actually, it could be enough to know the loudest shout and not which neuron did the shouting.


Confidence

The idea that every neuron in a bank reports to the group how well it thinks it matches the input is powerful. It follows, then, that the neuron that shouts the loudest would pat itself on the back by becoming more "confident" in its knowledge and thus reinforce what it knows. Conversely, all the other neurons would become no more confident and perhaps even less so with each passing moment that they go unused.

Confidence breeds stasis. In this case, that's ideal. What if some neurons in a bank were highly confident in what they know and others were very unconfident? Those that have low confidence should be busy looking for patterns to learn. In a rich environment, there will be a nearly limitless variety of new patterns that such neurons could learn. There are several ways a brain could decide that some piece of information is important. One is simple repetition. When you want to remember someone's name, you probably repeat it in your mind several times to help reinforce it. And in school, repetition is key to learning. So it could be that individual neurons of low confidence gain confidence when they latch onto some new pattern and see it repeated. Repetition suggests non-randomness and hence a natural sort of significance.

What if, as a neuron becomes more confident, it becomes less likely to change its expectation of what pattern it will match? What it confidence is itself a moderator of a neuron's flexibility to learning new patterns?


The simulation

Armed with this hypothesis, I set out to make a program called "Pattern Sniffer" to simulate a bank of neurons operating in this way and to test its viability. My goal, to be sure, is not to replicate human neocortical tissue. I suspect our brains do some of what my hypothesis entails, but my main goal is to see if learning can happen like this. Here's a screen shot from the program:


Screen shot from Pattern Sniffer program


You can download the Pattern Sniffer program and its source code. This is a VB.NET 2005 application. Once you unzip it, you will find the executable program at PatternSniffer\Ver_01\PatternSniffer\bin\Debug\PatternSniffer.exe. There is a PatternSniffer.exe.config file along-side it, which you can edit with a text editor to change certain settings, such as the number of neurons in the bank. There is a "Snapshots" subfolder, in case you wish to use the "Snapshot" button, not shown here.

The program's user interface is very simply as seen above. The main feature is a set of gray boxes representing individual neurons in a single bank. The grid of various shades of gray boxes in each represents the "dendrites" of each. Input values in this program are from -1 to +1. In this UI, -1 is represented as white and +1 as black. Each dendrite has an "expectation" of what its input value should be for it to consider itself to match. In this example, there are 25 input values; hence 25 dendrites per neuron. The top left corner of the program features an input grid, also with 25 values. The user can click on this to alternate each pixel from black to white. You probably won't want to use that, though, as the program comes with a SourcePatterns.bmp file that has 25 5x5 gray-scale images on it, which you can edit. Following is a magnified version of SourcePatterns.bmp:


SourcePatterns.bmp, magnified 10 times


When you start the program, the neurons start out in a "naive" state. They know nothing and hence have nearly zero confidence (shown as a white box in each neuron display above). As you click the "Random Patch" button, the program picks one of the patterns in SourcePatterns.bmp, displays a representation of it in the input grid, presents it to the neuron bank for a moment of consideration, and updates the display to reflect changes in the neuron bank's state. Check the "Keep going" check box to make pushing this button happen automatically.

To be clear, while the program displays a 2 dimensional grid of image data, the neurons have no awareness of either a grid or of it being graphical data. They only know they take a set of linear values as input. The inputs could be randomly reshuffled at the start with no impact on behavior. The grid and the choice of image data is simply to help us visualize what is going on inside the bank.

You can control how many of the patterns in the source set are used by changing the "Use first" number. If you choose 3, for example, patterns 1, 2, and 3 will be used to select randomly from with each click of the "Random Patch" button. At any time, you can specifically change the "Pattern" number to select a specific pattern to work with. Clicking "Linger" causes the bank to go through a single moment of "pondering" the input, just like when the user clicks "Random Patch". With each moment of pondering, the brain becomes more "set" in what it knows. Clicking "Brainwash" brings the entire neuron bank back to its naive state.

The "Noise" setting is a value from 0 to 100% and controls how degraded the input pattern is when presented to the neuron bank. At 100%, one pattern is nearly indistinguishable from any other.


Learning in linear time

Let's start with a familiar and yet simplistic case of training and using our neuron bank. We begin with the naive state as follows:



Pattern 1 contains all white pixels. With the first click of "Linger", the neurons in the bank all try to determine which of them best matches this pattern. In this case, neuron 14 (n14) is most similar:



Because it "yells the loudest", it is rewarded by having its confidence level raised ever so slightly and by moving its dendrites' expectation levels closer to the input pattern. The lower the confidence, the more pliable the dendrites' expectations are to change. Since n14 has near zero confidence (-1), it conforms nearly 100% in this single step. Clicking "Linger" 7 more times, n14 continues to be the best match and so continue to increase its confidence until it is nearly full confidence (+1):



Now we move to pattern 2 and repeat this. Pattern 2 is all black pixels. n23 happens to be most like this pattern, so with repetition it learns it quickly:



Notice in the preceding how n14 is still expecting the white pattern and has a high level of confidence. Its expectations have shifted every so slightly, indicated by the very faint gray boxes scatted within n14's display.

We continue this process for the first 6 patterns, picking one and lingering on it for 8 steps each, and end up with the following state:



You can quickly find the learned knowledge by looking for black confidence level boxes. At this point, you may wonder why the left, right, top, or bottom bar patterns would match neurons with randomized expectations better than, say, the solid white or solid black patterns. This has to do with the way matching occurs and is affected by a neuron's confidence level.

When the neuron bank is asked to "ponder" the current input, it goes through two steps, with each neuron being processed in turn in one step before the next step proceeds and each neuron is again processed. Step 1 is matching. It begins with each dendrite calculating its own match strength. The match strength is calculated as MaxSignal - Abs(Input - Expectation), where MaxSignal = 1. Thus, the closer the scalar input value is to the value expected by that dendrite, the closer the match strength will be to the maximum possible.

Things get interesting here. Before returning the match strength value, we alter it. If the strength is less than zero -- that is, if this dendrite finds the input value is very different -- then we "penalize" the match strength using Strength = Strength * Neuron.Confidence * 6. The final strength, whether adjusted or not, is divided by 6 to make sure the strength is never outside the min/max range of -1 to +1. So the more confident the neuron is in what it knows, the more strongly mismatched inputs will penalize the match value.

So now, if I set "Use first" to 6 and check "Keep going", the program will continually run through these first 6 patterns that have been learned and will always match and reinforce them. So far, this is not very remarkable, as it is easy to make a program learn any number of distinct digital patterns. As we'll see, however, there's a lot more to this than this cheap parlor trick.

What is remarkable, however, is the time it takes to learn. AI systems that include learning often suffer exponential increases in learning time as the amount of information to learn increases linearly. In this simple demonstration, it does not matter how many novel patterns are exposed to the neuron bank. It will take the same number of steps of repetition to solidify a naive neuron's knowledge. One simple estimate would be that it takes 8 steps to learn each new pattern, when they are presented in this fashion.

There are caveats, to be sure. For one, the configuration for this demo has only 26 neurons, which means it can only learn up to 26 distinct patterns. For another, as time passes and a neuron is not "used" -- if it never matches anything -- it slowly loses confidence that it is still useful and begins to degrade until it finally is naive again. So there is a practical limit to how many patterns can be taught before there has to be a "refreshment" process to bolster the existing neurons' confidences.


All at once learning

The story changes when learning is done in bulk. Let's change the experiment a little to illustrate. First, we'll brainwash our neuron bank. Then, we set "Use first" to 6, the same solid black and white patterns, plus the left, right, top and bottom bars that we saw before. Now we'll step through the process for a while (using the "Random Patch" button). Below is a series of screen shots. Note the "Steps taken" number in each step.



















When we started out, all neurons were naive, meaning they had not learned any patterns and they had no confidence in what they "knew". So as a new pattern is introduced in each moment, there's usually a "virgin" neuron that's happy to match and claim that pattern for its own. But watch the sequence of events for each neuron that does this as time moves on. Each one degrades quickly. In step 1, n21 is the first neuron to match anything, namely the solid black pattern. Yet one step later, when the input has a new pattern, n21 is already starting to decay. By step 8, with no further reinforcement yet, n21 has decayed so much that there's a good chance if the next step brings the solid black pattern back, it may not be the best match for it any more.

However, reinforcement does build confidence. The right bar pattern has been seen 3 times in the above sequence. n5 was the first to see it and, thanks to reinforcement, it has a higher degree of confidence and so its expectation pattern is more likely to persist longer without reinforcement. Still, this is not at all high. Let's see what happens as time progresses on and the patterns are seen more. Note the steps-taken number in each snapshot and how each learned neuron's confidence level grows with reinforcement:









OK. So after 80 steps, we have most of the patterns pretty well learned, save for the solid white pattern. By random chance, that one was simply not seen many times during this run. Still, this is markedly worse than when we spoon-fed the patterns to learn, one at a time. With 8 steps per pattern and 6 patterns, the learning process took only 48 steps. So maybe that's an indication that this is not a very good learning algorithm. Isn't the real world like this? And when we try this experiment with all 25 patterns thrown around at random, it may take thousands of steps to solidly learn them all instead of the 200 if we spoon-feed them.

But maybe this is exactly what we expect. Have you ever been in a room with someone speaking a language you don't understand? You may be exposed to hundreds of new words. If I asked you to repeat even three of them that you picked up (and did not already know), you might just shrug and tell me none of them really stuck. But if you asked one of the speakers to teach you one or two words, you might be able to retain them for the duration of the conversation and reliably repeat them. To use another analogy, consider a grade school English class. Would a teacher be more likely to expose the students to all of the vocabulary words at once and simply repeat them all every day, or instead to expose students to a small number of new vocabulary words each week? Clearly, learning a few new words a week is easier than learning the same several hundred all at once, starting from day one.

My interpretation of what's going on is that this neural network is behaving very much like our own brains do, in this sense. The more focused its attention is on learning a small number of patterns at one time, the faster it will learn them. This may seem like a weakness of our brains, but I don't think so. I believe this is one way our own brains filter out extraneous information. We're exposed to an endless stream of changing data. Some of it we already know and expect, but a lot of it is novel. Repetition, especially when it occurs in close succession, is a powerful way to suggest that a novel pattern is not random and therefore potentially interesting enough to learn. In fact, the very principle of rote learning seems to be based on this idea of hijacking this repetition-based learning system in our brains.


Learning while performing

As I mentioned in the introduction, I've long been bothered by the fact that most AI learning systems require a learning stage separate from a "performance" process. So far, we've been focused on learning with this novel sort of neural network I've made, and we'll continue to focus on that, but I want to stress that all the while that we are training this neural net, we are also watching it perform. Its only task, in this experiment, is to match patterns it sees.

One simple way to prove this point is to train the neuron bank on however many patterns you wish and then just check the "Keep going" box and watch it perform. Then, at some point, try adding one more pattern using the "Use first" number while it continues crunching away. It will eventually learn the new pattern, all the while still performing its main task of matching patterns. There is no cue we send to the neuron bank that we are introducing a new pattern. In fact, the neuron bank doesn't know any of these numbers we see on the screen. It doesn't, for example, know that we have 25 total patterns, or that we are only using 6 of them at the moment. We don't check any box saying, "you are now supposed to be learning". It just does both constantly; both learning and performing.


Noisy data

I said earlier that having a machine learn 6 digital image patterns is just a cheap programming parlor trick. But I said there is more to this. Numenta's Pictures demo app of their HTM concept is configured such that a single node adds a quantization point for each bit-level unique pattern it comes across. True, the HTM can be configured to be a little more relaxed and to consider two similar patterns to represent one and the same, but you have to program the threshold of similarity in in advance of learning. So really, one is very likely to end up with a very large set of quantization points if the training data is noisy. And their own white paper states, "The system achieved 66 percent recognition accuracy on the test image set," hardly impressive. Traditional ANNs seem to be a little less sensitive to noise, but they aren't perfect, either.

The matching algorithm for this neural network is incredibly simple: just add together the differences between the expected and actual input values and multiply them by other basic factors like confidence level. But as you'll see in the following experiments, this makes it very competent at dealing with noise.

Let's start by setting "Noise" to 50% and brainwashing. We'll take the top bar pattern (#3) as our starting point and click "Linger" a few times. Watch what happens in the following sequence:



















Notice now n21's expectations, in step 1, look exactly like the first noisy version of the top-bar that it sees? Yet in each successive step of learning, as it gets new noisy versions, its expectation shifts more towards the perfectly noise-free top-bar pattern it never actually sees. It's learning a mostly noise-free version of a pattern it never sees without that noise!

Is this magic? Not at all. The noise is purely random, not structured. That means with each successive step, n21 is averaging out the pixel values and thus cancelling the noise. Now, n21 is also becoming more confident, though more slowly than it did when it saw the noise free version. So with each passing moment, the pattern is changing more and more slowly. Eventually, it will become fairly solid.

Let's continue this experiment by training the bank with the first 6 patterns:



With manual spoon-feed learning of each of the 6 patterns, we get to step 90 and all 6 are pretty solidly learned. We can now switch on the "Keep going" check box to let it cycle at random through all 6 patterns indefinitely and it will continue to work just fine, with 100% accuracy (to be sure, I spot-checked; I didn't check the match accuracy at all steps), in spite of the noise and all the neurons hungrily looking for new patterns to learn. Here it is after 150 unattended steps, still solid in its knowledge:



Now, we turn the noise level up to 75%. Watch how well it continues to work:



















Look back carefully at these 8 steps, because they are very telling. Remember: the neuron bank has no idea that I am still using the same 6 patterns I trained it on. Remember also that with a highly confident neuron, there is a high penalty for each poorly matched dendrite. Looking at the input patterns, I'm struck by how badly degraded they are and thus difficult for me to match, yet the neuron bank seems to perform brilliantly. Only at step 155 do we finally see a pattern so badly degraded that the bank decides it's a novel one it might want to learn. Of course, it's never going to be seen again, so this blip will be quickly forgotten and n8 will be free to try learning some other new pattern. In all 7 of the other steps, it matches the noisy input pattern correctly.

This isn't the end of the story, though. Noise filtering cuts both ways. Some unique patterns will be treated as simply noisy versions of known patterns. Take another look at the source patterns:


SourcePatterns.bmp, magnified 10 times


Near the bottom, there are four "arrow" patterns. To your eye, they probably look pretty distinctly different from the side bar patterns (left, right, top, bottom) that we've been working with, but to this neural net, they are so similar that they are considered to be simply noisy versions of the bars. Or, conversely, the bars are seen as noisy versions of the arrows. Here's our neuron bank after a brainwashing and learning the first 19 patterns, just before we get to the arrows. You can see the first patterns (solid white and black) to be learned are starting to degrade:



Now to introduce one of the arrows to the bank. See how, in just a few steps, this confident neuron's expectations change to start looking like the arrow?











Longevity of information

Now that I've illustrated some of what this particular program can do and thus some of the potential capabilities for machine learning using this concept, I think I can more easily speak about some of its weaknesses and suggest some potential ways to overcome them.

For one thing, longevity is lacking. What is learned in this particular demonstration by one neuron can be unlearned within a few minutes of running without seeing that pattern again. That's obviously not a desirable capability of a machine that may have a useful life of many years. But that doesn't mean that this is a limitation of this type of system, per se. I set out to demonstrate not only how a neural network can learn while being productive, but also how unused neurons can be freed up to learn new things without any central control over resource allocation.

I did address this to some degree in the current algorithm, actually. As described earlier, a neuron loses confidence over time if it is unused, and therefore becomes more pliable to adjusting its expectations. However, the degree to which it loses confidence, in any given step, is determined in part by the best match value seen. That is, if some neuron has a very strong match of the current input pattern, then a non-matching neuron will not lose much confidence. If, however, none of the other neurons considers itself to be a strong match, that could potentially mean that there's a new pattern to learn, and so the non-matching neurons will lose confidence a little faster.

One way that this algorithm could be improved is by consideration of how "full" a neuron bank is of knowledge. Perhaps when a bank has a lot of naive neurons, those that are highly confident of what they know should be less likely to lose confidence. Conversely, when there are few or no neurons that remain naive, there could be a higher pressure to lose confidence. Perhaps this could further be adjusted based on the rate of novelty in input patterns, but that's harder to measure.

Perhaps there are higher level ways that memory could be evaluated for importance and, over time, exercised in order to keep it clean and strong.


Working memory

When I started making this program, I was not really considering the problem identified earlier in this blog entry of working memory versus long term memory. But in the course of building and testing Pattern Sniffer, it dawned on me that my neural network was displaying both short and long term learning within the same system. The key difference was not structure, locality, or anything so complicated, but simply repetition.

Yes, in the sample program, we are learning and matching simple visual patterns. But this same kind of memory could just as easily be used to learn a phone number sequence long enough to dial it. Or to remember a visual pattern long enough to match it to something else in the room. And, without heavy repetition, the neuron(s) that remember it will decay again into naivete, ready to learn some other pattern.


Pattern invariance

I think this sample program well demonstrates this kind of neural network's insensitivity to noisy data. Still, one thing it clearly is not is insensitive to patterns of information that are subtly transformed.

With this program, I decided I would use a small visual patch for demonstration purposes in part because I though it would be worth perhaps replicating the ability of our own retinas to detect and report strong edges and edge-like features at different angles, especially if it could learn about edges all on its own. But I must admit this was also a cheat of the same sort many AI researchers tackling vision do: forcibly constrain the source data to take advantage of easy-to-code techniques.

To their credit, the Numenta team have come up with a crafty way of discerning that different patterns of input are representative of the same things by starting with the assumption that "spatial" patterns that appear in close time succession to one another very likely have the same "cause" and thus such closely tied spatial patterns should be treated as effectively the same, when reporting to higher levels of the brain.

I think the kind of neural network I've engendered in Pattern Sniffer can benefit from this concept as well. Implicitly, it already embraces the notion that the same pattern, repeated in close succession, has the same cause and is thus significant enough to learn. But to be able to see that two rather different spatial patterns have a common cause could be very powerful. One way to do this would be to have a neuron bank above the first which is responsible for discovering two-step (or longer) sequences in the lower level's data. If, for example, the first level has 10 neurons, the second level could take 20 inputs: 10 for one moment of output and 10 more for the following moment. In keeping with Jeff Hawkins' vision of information flowing both up and down a neural hierarchy, discovering such temporal patterns, the upper neuron bank could "reward" the contributing lower level neurons by pushing up their confidence levels even faster. This higher level neuron bank could even be designed to respond either to the sequence being seen or to any one of its constituents being seen, and thus serve as an "if I see A, B, or C, I'll treat them as all the same thing" kind of operation.

One thing I had originally envisioned but never implemented is the concept of "don't care". If you look at the source code, you'll notice each dendrite has not only an "expectation", but also a "care" property. The idea was that care would be a value from 0 to 1. Multiplying the match strength by the "care" value would effectively mean that the less a dendrite cares about the input value, the less likely it would be to contribute positively or negatively to the neuron's overall match strength. I was impressed enough with the results of the algorithm without this that I never bothered exploring it further. Honestly, I don't even know quite how I would use it. I had assumed that a neuron could strongly learn some pattern's essential parts and learn to ignore nonessentials by observing that certain parts of a recurring pattern themselves don't recur. But that simply led me to wonder how a neuron bank would decide whether to allocate two or more neurons for pattern variants or to allocate a single neuron with those variants ignored. There's still room to explore this concept further, as it seems almost intuitively like something our own brains would do.


More to explore

This is obviously not the end of this concept for me. I think one logical next area of exploration will be hierarchy. I also want to see if and what even the current arrangement can learn when it is exposed to "real world data". Even with noise added, the truth is I'm just feeding this thing carefully crafted, strong patterns that seem of dubious relation to the messy sensory world we inhabit.

I certainly welcome others to dabble in this concept as well. You can play with this sample program yourself. The .config file gives you control over a bunch of factors, you can supply your own source-patterns graphic, and the program's user interface is fairly easy to extend for other experiments. The NeuronBank class and all of its lower level parts is very self-contained and independent of the UI, which means it can easily be applied in other ways without the need for this or even any user interface. And the core code is surprisingly lightweight (only 3 classes) and heavily commented, so it should be easy to study and even reproduce in other environments.

So we'll see what's next.


The nuts and bolts of the algorithm

I've tried to describe the concepts of the Pattern Sniffer demonstration program in plain English and with visuals, but it's worthwhile to go into more detail for people more interested in the details of how this algorithm actually works. I'll ignore the UI and test program and focus exclusively on the neuron bank and its constituent parts.

Following is a list of the classes and their essential public members:

  • NeuronBank:
    • Inputs As List(Of Single)
    • Neurons AsList(OfNeuron)
    • New(InputCount, NeuronCount)
    • Brainwash()
    • Ponder()

  • Neuron:
    • Bank As NeuronBank
    • Dendrites As List(OfDendrite)
    • MatchStrength As Single
    • Confidence As Single
    • New(Bank, ListIndex, DendriteCount)
    • Brainwash()
    • PonderStep1()
    • PonderStep2()

  • Dendrite:
    • ForNeuron As Neruon
    • InputIndex As Integer
    • Expectation As Single
    • MatchStrength As Single
    • New(ForNeuron, InputIndex)
    • Brainwash()


Next is the algorithm for behavior. Aside from basic maintenance like the .Brainwash() methods, there really is only one single operation that the neuron bank and all its parts perform. Each "moment", the input values are set and the neuron bank "ponders" the inputs. Here's a pseudo-code summary of how it works. All the methods and properties have been mashed into one chunk to make it easier to read the process in a linear fashion. Here's the short version:


Loop endlessly

Set values in Bank.Inputs (each value is a single floating point number from -1 to 1)

Sub Bank.Ponder()
For Each N in Me.Neurons
N.PonderStep1() (Measure the strength of my own match to the current input.)
Next N
For Each N in Me.Neurons
N.PonderStep2() (Adjust my confidence level and dendrite expectations.)
Next N
End Sub

For Each N In Bank.Neurons
Do something with N.MatchValue
Next

Continue looping


And now the more detailed version, fleshing out PonderStep1() and PonderStep2():


Loop endlessly

Set values in Bank.Inputs (each value is a single floating point number from -1 to 1)

Sub Bank.Ponder()
For Each N in Me.Neurons

Sub N.PonderStep1()
'Measure the strength of my own match to the current input.

'Add up all the dendrite strengths.
For Each D in Me.Dendrites
Strength = Strength + D.MatchStrength

Function D.MatchStrenth() As Single
Input = ForNeuron.Bank.Inputs(Me.InputIndex)

Strength = 1 - AbsoluteValue(Input - m_Expectation)
Strength = Strength / 6

'Penalize strongly mismatched values.
If Strength < 0 Then
Strength = Strength * ForNeuron.Confidence * 6
End If

Return Strength
End Function D.MatchStrength()

Next D

'Divide the total to get the average dendrite strength.
Strength = Strength / DendriteCount

'Maybe I am the new best match.
If Strength > Bank.BestMatchValue Then
Bank.BestMatchValue = Strength
Bank.BestMatchIndex = Me.ListIndex
End If

Me.MatchStrength = Strength
End Sub N.PonderStep1()

Next N
For Each N in Me.Neurons

Sub N.PonderStep2()
'Adjust my confidence level and dendrite expectations.

If Me.ListIndex = Bank.BestMatchIndex Then 'I have the best match

'Boost my confidence a little.
Me.Confidence = Me.Confidence + 0.8 * Me.MatchStrength
If Me.Confidence > 0.9 Then Me.Confidence = 0.9 'Maximum possible confidence.

For i = 0 To Me.Dendrites.Count - 1
D = Me.Dendrites(i)
Input = Bank.Inputs(i)

'How far away is this dendrite's value from what's expected?
Delta = Input - D.Expectation

'The more confident I am, the less I want to deviate from my current expectation.
Delta = Delta * (1 - Me.Confidence)
D.Expectation = D.Expectation + Delta
Next i

Else 'I don't have the best match

'I should lose confidence more when no other neuron has a strong match.
Me.Confidence = Me.Confidence * 0.001 * (1 - Bank.BestMatchValue)
If Me.Confidence < 0.05 Then Me.Confidence = 0.05 'Minimum possible confidence.

For i = 0 To Me.Dendrites.Count - 1
D = Me.Dendrites(i)
Input = Bank.Inputs(i)
If Bank.BestMatchValue - Me.MatchStrength <= 0.1 Then
'I must be pretty close to the current best match.

'Get more random.
D.Expectation = D.Expectation + RandomPlusMinus(0.05) * (1 - Me.Confidence)

Else 'I don't strongly match the current input.

'How far away is this dendrite's value from what's expected?
Delta = Input - D.Expectation

'The more confident I am, the less I want to deviate from current expectation.
Delta = Delta * (1 - Confidence)

'Get a little closer to the current input value.
D.Expectation = D.Expectation + RandomPlusMinus(0.00001) * Delta * 0.2
End If
Next i

End If 'Do I have the best match or no?

End Sub N.PonderStep2()


Next N
End Sub

For Each N In Bank.Neurons
Do something with N.MatchValue
Next

Continue looping


It might be entertaining to try to boil this down to a few lengthy mathematical formulas, but I usually find those more intimidating than helpful.