Search This Blog

Saturday, July 30, 2005

Patch equivalence

[Audio Version]

As I've been dodging about among areas of machine vision, I've been searching for similarities among the possible techniques they could employ. I think I've started to see at least one important similarity. For lack of a better term, I'm calling it "patch equivalence", or "PE".

The concept begins with a deceptively simple assertion about human perception: that there are neurons (or tight groups of them) that do nothing but compare two separate "patches" of input to see if they are the same. A "patch", generally, is just a tight region of neural tissue that brings input information from a region of the total input. With one eye, for example, a patch might represent a very small region of the total image that that eye sees. For hearing, a patch might be a fraction of a second of time spent listening to sounds within a somewhat narrow band of frequencies, as another example. A "scene", here, is a contiguous string of information that is roughly continuous in space (e.g., the whole image seen by one eye in a moment) or time (e.g., a few seconds of music heard by an ear). The claim here is that for any given patch of input, there is a neuron or small group of them that is looking at that patch and at another patch of the same size and resolution, but somewhere else in the scene. Further, that neuron (group) is always looking in the same pair of places at any given time. It doesn't scan other areas of the scene; just the pair of places it knows. We'll call this neuron or small group of neurons a "patch comparator".

From an engineering perspective, the PE concept is both seductively simple and horribly frightening. If I were designing a hardware solution from scratch, I imagine it would be quite easy to implement, and could execute very quickly. When I think about a software simulation of such a machine, though, it's clear to me that it would be terribly slow to run. Imagine every pixel in the scene having a large number of patch comparators associated with it. Each one would look at a small patch - maybe 5 x 5 pixels, for instance - around that pixel and at the same size patch somewhere else in the scene. One comparator might look 20 pixels to the left, another might look 1 pixel above that, another 2 pixels above, and so on until there's a sufficient amount of coverage within a certain radius around the central patch being compared. There could literally be thousands of patch comparisons done for just one single pixel in a single snapshot. Such an algorithm would not perform very quickly, to say the least.

Let's say the output of each patch comparator is a value from 0 to 1, where 1 indicates that the two patches are, pixel for pixel, identical and 0 means they are definitively different. Any value between indicates varying degrees of similarity.

One might well ask what the output of such a process is. What's the point? To be honest, I'm still not entirely sure, yet. It's a bit like asking what a brain would do with edge detection output. To my knowledge, nobody really knows in much detail, yet.

Still, I can easily see how patch equivalence could be used in many facets of input processing. Consider binocular vision, for example. You've got images coming from both eyes and you generally want to match up the objects you see in each eye, in part to help you know how far each is. One patch comparator could be looking at one place in one eye and the same place in the other. Another comparator could then be looking at the same place in the left eye as before, but in a different place in the right eye, for instance. Naturally, there would be all sorts of "false positive" matches. But if we survey a bunch of comparators that are looking at the same offset and most of them are seeing matches with that offset, we would take the consensus as indicating a likelihood that we have a genuine match. We'd throw out all the other spurious matches as noise, for lack of a regional consensus.

Pattern detection is another example of where this technique can be used. Have you ever studied a printed table of numeric or textual data where one column contains mostly a single value (e.g., "100" or "Blue")? Perhaps it's a song playlist with a dozen songs from one album, followed by a dozen from another. You scan down the list and see the name of the first album is the same for the upper dozen songs. You don't even have to read them, because your visual system tells you they all look the same. That's pretty amazing, when you think about it. In fact, I've found I can scan down lists of hundreds of identical things looking for an exception, and can do it surprisingly quickly. It's not special to me, of course; we all can. How is it that my eyes instantly pick up the similarity and call out one item that's different? It's a repeating pattern, just like a checker board or bathroom tiles. A patch equivalence algorithm would find excellent use here. Given an offset roughly equal to the distance between the centers of two neighboring lines of text, a region of comparators would quickly come to a consensus that there's equivalence at that offset. Because it's at the same time and from the same eye, the conclusion would be that it's probably from a repeating pattern. As a side note, this doesn't sufficiently explain how we detect less regular patterns, like a table full of differently colored candies, but I suspect PE can play a role in explaining that, too.

What about motion? PE can help here, too. Imagine a layer of PE comparators that study the image seen by one eye now with the same image seen a fraction of a second ago. A ball is moving through the scene, so the ball sits in one place in one image and perhaps sits a little to the right if that in the next image. Again, one region of patch comparators that sees the ball in its before and after positions lights up in consensus and thus effectively reports the position and velocity of the moving object.

I've focused on vision, but I do believe the patch equivalence concept can apply to other senses. Consider the act of listening to a song. The tempo is easily detected very quickly for most songs, and that alone can be explained by reference to PE comparators that are looking at linear patches of frequency responses at different time offsets. Or it could be looking not at low level frequency responses, but instead at recognized patterns that represent snippets of instruments at different frequencies. In fact, it may well be that we mainly come to recognize distinct sounds as distinct only because they are repeated. A comparator might be looking at one two dimensional patch that's actually made up of several frequency bands in a small snippet of time and looking at the same kind of patch at a different point in time. If it sees the same exact response in both moments, this fact could result in saving that patch's pattern in short-term memory for later. More repetitions could continue to reinforce this pattern until it's saved for longer term recollection.

This same principal of selecting patch patterns that repeat in space or time provides a strong explanation of how patterns would come to be considered important enough to remember. This is a rather hard problem in AI, now, in large part because selecting important features seems to presuppose the idea that you can find punctuations between features -- an a priori definition of "important" -- like pauses between words or empty spaces around objects in a scene. Using PE, this may not even be necessary, and potentially provides a more amorphous conception of what a boundary really is.

Tuesday, July 12, 2005

Machine vision: motion-based segmentation

[Audio Version]

I've been experimenting, with limited success, with different ways of finding objects in images using what some vision researchers would call "preattentive" techniques, meaning not involving special knowledge of the nature of the objects to be seen. The work is frustrating in large part because of how confounding real-world images can be to simple analyses and because it's hard to nail down exactly what the goals for a preattentive-level vision system should be. In machine vision circles, this is generally called "segmentation", and usually refers more specifically to segmentation of regions of color, texture, or depth.

Jeff Hawkins (On Intelligence) would say that there's a general-purpose "cortical algorithm" that starts out naive and simply learns to predict how pixel patterns will change from moment to moment. Appealingly simple as that sounds, I find it nearly impossible to square with all I've been learning about the human visual system. From all the literature I've been trying to absorb, it's become quite clear that we still don't know much at all about the mechanisms of human vision. We have a wealth of tidbits of knowledge, but still no comprehensive theory that can be tested by emulation in computers. And it's equally clear nobody in the machine vision realm has found an alternative pathway to general purpose vision, either.

Segmentation seems a practical research goal for now. There has already been quite a bit of research into segmentation based on edges, on smoothly continuous color areas, on textures, and based on binocular disparity. I'm choosing to pursue something I can't seem to find literature on: segmentation of "layers" in animated, three dimensional scenes. Donald D. Hoffman (Visual Intelligence) makes the very strong point that our eyes favor "generic views". If we see two lines meeting at a point in a line drawing, we'll interpret the scene as representing two lines that meet at a point in 3D space, for example. The lines could be interpreted as having their endpoints coincidentally meeting, even though in the Z axis, they may be very far apart, but the concept of generic views says that that sort of coincidence would be so statistically unlikely that we can assume it just doesn't happen.

The principle of generic views seems to apply in animations as well. Picture yourself walking along a path through a park. Things around you are not moving much. Imagine you take a picture once for every step you take in which the center of the picture is always fixed on some distant point and you are keeping the camera level. Later, you study the sequence of pictures. For each pair of adjacent pictures in the sequence, you visually notice that very little seems to change. Yet when you inspect each pixel of the image, a great many of them do change color. You wonder why, but you quickly realize what's happening is that the color of one pixel in the before picture has more or less moved to another location in the after picture. As you study more images in the sequence, you notice a consistent pattern emerging. Near the center point in each image, the pixels don't move very much from frame to frame and the ones farther from the center tend to move in ever larger increments and almost always in a direction that radiates away from the center point.

You're tempted to conclude that you could create a simple algorithm to track the components of sequences captured in this way by simply "smearing" the previous image's pixels outward using a fairly simple mathematical equation based on each pixel's position with respect to the center, but something about the math doesn't seem to work out quite right. With more observation, you notice that trees and rocks alongside the path that are nearer to you than, say, the bushes behind them act a little differently. Their pixels move outward slightly faster than those of the bushes behind them. In fact, the closer an object is to you as you pass it, the faster its pixels seem to morph their way outward. The pixels in the far off hills and sky don't move much at all, for example.

At one point during the walk, you took a 90° left turn in the path and began fixating the camera on a new point. The turn took about 40 frames. In that time, we lost that fixed central point, but the intermediate frames seem to act in the same sort of way. This time, though, instead of smearing radially outward from a central point, the pixels appear to be shoved rapidly to the right of the field of view. It's almost as though we just had a very large bitmap image that we could only see a small rectangle of that was moving over that larger image.

By now, I hope I've impressed on you the idea that in a video stream of typical events in life, much of what is happening from frame to frame is largely a subtle shifting of regions of pixels. Although I've been struggling lately to figure out an effective algorithm to take advantage of this, I am fairly convinced this is likely one of the basic operations that may be going on in our own visual systems. And even if it's not, it seems to be a very valuable technique to employ in pursuit of general purpose machine vision. There seem to be at least two significant benefits that can be gained from application of this principle: segmentation and suppression of uninteresting stimuli.

Dalmation hidden in an ambiguous scene Consider segmentation. You've probably seen a variant of the "hidden dalmation" image at right here in which the information is ambiguous enough that you have to look rather carefully to grasp what you are looking at. What makes such an illusion all the more fascinating is when it starts out with an even more ambiguous still image that then begins into animation as the dog walks. The dog jumps right out of ambiguity. (Unfortunately, I couldn't find a video of it online to show.) I'm convinced that the reason that the animated version is so much easier to process is that the dog as a whole and its parts move consistently along their own paths from moment to moment as the background moves along its own path and that we see the regions as separate. What's more, I'm confident we also instantly grasp that the dog is in front of the background and not the other way around because we see parts of the background disappearing behind the parts of the dog, which don't get occluded by the background parts.

Motion-based segmentation of this sort seems more computationally complicated than, say, using just edges or color regions, but it carries with it this very powerful value of clearly placing layers in front of or behind one another. What's more, it seems it should be fairly straightforward to take parts that get covered up in subsequent frames and others that get revealed to actually build more complete images of parts of a scene that are occasionally covered by other things.

Another way of looking at why motion-based segmentation of this sort is special, consider the fact that it lets something that might otherwise be very hard to segment out using current techniques, such as a child against a graffiti-covered wall, stand out in a striking fashion as it moves in some way different from its background.

Now consider suppression of uninteresting stimuli. It seems in humans that our gaze is generally drawn to rapid or sudden motions in our fields of view. It's easy to see this by just standing around in a field as birds fly about, for instance, or on a busy street, for another. What's more, rapid motions that are unexpected which appear in even the farthest periphery of your visual field are likely to draw your attention away from otherwise static views in front of you. If you wanted to implement this in a computer, it would be pretty easy if the camera were stationary. You simply make it so each pixel slowly gets used to the ambient color and gets painted black. Only pixels that vary dramatically from the ambient color get painted some other color. Then you would use fairly conventional techniques to measure the size and central position of such moving blobs. But what if the camera were in the front windshield watching ahead as you drive? If you could identify the different segments that are moving in their own ways, you could probably fairly quickly get around to ignoring the ambient background. Things like a car changing lanes in front of you or a street sign passing overhead would be more likely to stand out because of their differing relative motions.

I'm in the process of trying to create algorithms to implement this concept of motion-based visual segmentation. To be honest, I'm not having much luck. This may be in part because I haven't much time to devote to it, but it's surely also because it's not easy. So far, I've experimented a little with the idea of searching the entire after image for candidates where a pixel in the before image might have gone in the hopes of narrowing down the possibilities by considering that pixel's neighbors' own candidate locations. Each candidate location would be expressed as an offset vector, which means that neighboring candidates' vectors can easily be compared to see how different they are from one another. When neighboring pixels all move together, they will have identical offset vectors, for instance. I haven't completed such an algorithm, though, because it's not apparent to me that this would be enough without a significant amount of crafty optimization. The number of candidates seems to be quite large, especially if all the pixels in the after image are potential candidates for movement of each pixel in the before image.

One other observation I've made that could have an impact on improving performance is that it seems that most objects that can be segmented out using this technique probably have fairly strongly defined edges around them, any way. Hence, it may make sense to assume that the pixels around one pixel will probably be in the same patch as that one unless they are along edge boundaries. Then, it's up for grabs. Conversely, it may be worthwhile considering only edge pixels' motions. This seems like it would garner more dubious results, but may be faster because it could require consideration of fewer pixels. One related fact is that it should be that the side of an edge which is on the nearer region should remain fairly constant, while the side in the farther region will be changing over time as parts of the background are occluded or revealed. This fact may help in identifying which apparent edges represent actual boundaries between foreground and background regions and particularly in determining which side of the edge is foreground and which background.

I'm encouraged, actually, by a somewhat related technology that may be able to be applied to this problem. I suspect that this same technique is used in our own eyes for binocular vision. That is, the left and right eye images in a given moment are a lot like adjacent frames in an animation: subtly shifted versions of one another. Much hard research has gone into making practical 3D imaging systems that use two or more ordinary video cameras in a radial array around a subject, such as a person sitting in a chair. Although the basic premise seems pretty straightforward, I've often wondered how they actually figure out how a given pixel maps to another one. The constraint of pixel shifting being offset in only the horizontal direction probably helps a lot, but I strongly suspect that people with experience developing these techniques would be well placed to engineer a decent motion-based segmentation algorithm.

One feature that will probably confound the attempt to engineer a good solution is the fact that parts of an image may rotate as well as shift and change apparent size (moving forward and backward). Rotation means that the offset vectors for pixels will not be simply be nearly identical, as in simple shifting. It should mean that the vectors vary subtly from pixel to pixel, rather like a color gradient in a region shifting subtly from green to blue. The same should go for changes in apparent size. The good news is that once the pixels are mapped from before to after frames, the "offset gradients" should tell a lot about the nature of the relative motion. It should, for example, be fairly straightforward to tell if rotation is occurring and find its central axis. And it should be similarly straightforward to tell if the object is apparently getting larger or smaller and hence moving towards or away from the viewer.