Worthwhile Projects: The Challenge of Conventional Wisdom

Spacesuits and glove boxes

I believe that the space suits currently being used by NASA (and the Russians, for that matter) are in need of complete redesign. Spacesuits are intended to allow a person in a comfortable environment to manipulate objects in a nearby hazardous environment. This is exactly what a “glove box” used for sand blasting, electronic production or chemical handling does here on earth. Why not just invert the view and put the astronaut inside the glove box? This means we get rid of the “suit” concept. Why does an astronaut in zero G need his legs? And separate boots? Why can’t he take his hands out of the gloves? The biggest complaint that spacewalking astronauts have is that their hands get cold and tired.

My suggestion is that the entire torso and leg section of the suit be replaced with a rigid canister that the astronaut could rest inside. Perhaps when he “stands up” his head is positioned properly inside the helmet area so he can see, and his arms reach properly into the gloves. When he “sits down”, perhaps cross-legged, he has an area where he could eat, drink, relax, and so on. All he would really need is some sort of rigid grapple mounted on the outside of the chest area so he could clamp the suit in position to the structure he is working on. This would keep his motions inside the suit from starting a spin or other undesirable attitude.

A small hole in a glove is very dangerous. Currently, you would have to try to patch it against the escaping air. If you could get your hand out of the glove and apply a patch to the inside it would tend to be self-sealing.
This design would not be significantly different in volume than the current suits, so the power and air handling systems would be about the same. It would also fit through standard hatches and could be positioned by the same station or shuttle robot arms.

Why hasn’t anyone done a design like this? Because it cannot be tested, except in orbit. Astronauts are specifically trained for each EVA, simulating the situations they might encounter while working in space. Current suit designs are used extensively in EVA training sessions underwater. This simulates weightlessness in the general sense, but the astronaut is not weightless inside the suit. No astronaut could get valid experience using a glove-box type design until he was actually weightless. Therefore, any realistic design, experimentation, construction, testing, and revision would have to be done in space. There simply are not the facilities, resources, time, or personnel to do this work safely in orbit today. So we are left with the rather ludicrous legacy suits. They are designed to work in two completely different, incompatible environments and do a poor job in both.

A rapid prototyping facility in orbit would allow people in space to build the tools that they envision. And their vision will be completely different from the vision of engineers on earth. Using some of these Replicator concepts should help to make such a facility safe and sustainable.

* * *

Many of the systems in use today are the result of legacy discoveries or observations. In many cases entire industries have been founded on the basis of one particular observation or another. As these industries grow and become more ingrained in society, it becomes more and more difficult to rethink the basic premises. I am concerned that many of the foundations of modern society are based on such fundamental flaws that much of what is being produced today is at risk of catastrophic obsolescence.

I will give two areas of concern, broadly termed Auditory Systems and Vision Systems. I believe we must look to a future when all of our current multimedia industries will seem as though they were turning out daguerreotypes and recordings on wax cylinders.

Auditory Systems

Stereo audio systems are based on the simple observation that people have two ears. The assumption is that spatial discrimination is based on differences in timing and amplitude of the sounds heard by each ear. And you can create some real “Wow” effects by artificially increasing the separation of the two channels.

Unfortunately, this is not all that goes on in audio perception. In the real world, individuals interact with their environment. They turn their head. They use a multitude of cues to identify the location and nature of a sound. This is why surround-sound systems are increasingly popular. They are slightly more realistic.

Human beings are really very good at processing auditory information. Most people just do not realize it. You can tell the difference between standing in an empty room and one with furniture in it, just by the ambient sound. You can tell how close you are to a wall by the sound if you practice for a little while.

Your home theater system will have an audio “sweet spot” where the sounds are most like what the producers intended. If you move away from that spot the audio illusion does not travel with you any more than the visual one does. You get a distorted presentation and the visual cues and audio cues will be mismatched. Some people can take this in stride, like enjoying a roller coaster ride. It makes other people nauseous.

For our purposes here, the problem is that this audio information is insufficient. Each listener needs to be able to interact with the environment. Their own perceptual systems are unique. The spectral responses of their ears and the changes they expect as they move, breathe, swallow, etc. are all unique and affect the believability of the illusion in subtle ways. Just putting a pounding base line on a sub-woofer does not make a believable motorcycle engine.

Another area of glaring deficiency is the synthesis of voices. One would think that creating believable voices would be much simpler than visual animation. After all, you can recognize the person speaking over a telephone. Single channel, low bandwidth, digitized audio. This should be nothing compared to the data and bandwidth requirements of video.

Even after years of research and development there are only a handful of speech synthesizers that come close to reality. And they are extremely limited, carefully tailored algorithms. I expect a proper speech synthesizer to be able to accurately emulate any human being. If I want it to do Katherine Hepburn, it should sound exactly like I would expect. If I want Sean Connery, that is what I should get.

I should be able to simulate age, gender and accents. I should be able to convey emotion: fear, rage, lust. I should be able to yell or whisper. And foreign languages would be no problem.

Current speech synthesis cannot even get the prosody (rhythm, stress and intonation) of simple sentences right. I expect a proper speech synthesizer to be able to sing. I see nothing unreasonable in requesting my Replicator to produce a rendition of Pirates of Penzance as performed by Sean Connery.

In short, I may have a crew of hundreds of animators to synthesize Shrek, but I still need Mike Meyers, Eddie Murphy and Cameron Diaz. And the voice parts represent a tiny, tiny amount of data.

* * *

The flip side of this is speech recognition systems. After years of research, they are also a pitiful shadow of what they need to be. Again, we have a tiny amount of data to deal with. No more than eight kilobytes or so per second for digitized audio, like over a telephone. To be rendered into text at no more than about four bytes per second: call it forty words per minute, like a stenographer can do.

Developers discuss such concepts as “speaker independent” voice recognition systems. I contend that there is no such thing. The key to recognizing speech is to have a huge library of speech to compare a given sample to. Our illusion of speaker independence is created by our experience with thousands of different people: we are rapidly choosing among thousands of speaker-dependent patterns. If I were to stand on a street corner and have one hundred random people pass by and speak one hundred random, single words to me I would probably be very slow and inaccurate in my understanding. If, however, each of those hundred random people said a sentence, my brain would be able to pull out age, gender, ethnicity, accent and other factors which would be used to tailor my expectations and make my recognition system much more accurate.

My three-year-old grandson is in the process of building such a universal library for speech recognition. Everyone he hears speak, either directly to him, or in the background, or on television adds to his repertoire of context and recognizable words. He may not know meaning, spelling or anything else, but he will certainly be able to tell when his mother, father or Spongebob are discussing “crab fishing”. As far as he is concerned: same sounds, different speakers - no problem.

Knowledge of the speaker and the expectations that your brain derives from that knowledge is what allows us to pull a single voice out of cocktail party chatter or simple background noise. Tailoring expectations is also what makes it so much easier to understand someone when you can see their lips. The broader your experience and the larger your exposure to different speakers, the more likely it is that your brain will be able to choose a good template to match against the sounds it hears.

* * *

I believe that, in the long run, speech recognition and synthesis systems will be parts of a single whole. The speech recognition portion would have examples of Katherine Hepburn to tailor its expectations when analyzing her speech. The speech synthesis would be adaptable and would iteratively feed samples into the recognition system to see how well it approximated the expectations. Just the way a voice actor listens and experiments to learn an accent.

Adaptive systems such as this would make the man-machine interaction much more reliable by allowing the machine to automatically switch to the language pattern most easily understood by the user - for both speaking and listening. This would minimize the misunderstandings in both directions.

Vision Systems

Human vision is very good at spotting important details. We are descended from millions of generations of individuals who were not eaten by the saber-toothed cat. We can spot tiny clues to larger patterns hidden in the bushes. Sometimes, we see things that aren’t really there, but this is the safe option. To fail to see something that really was there could be fatal. As long as we are not too jumpy to find a meal and eat it, we will do OK.

This vision system is very good, but we can have some fun with it. Play some cute tricks. Every grade school child has seen a cartoon flip-book. Make your own little animated character. One still frame after another, your brain interprets it as a moving image. Over the past century, we have taken this trick and built entire industries around it. Movies and Television and Video Games.

The problem is that this flip-book trick bears essentially no relation to what is actually going on in our visual perception. Yes, it usually works. No, it does not really allow us to perceive all we could.

Our retina is designed to detect changes in light level. We see things when edges pass across the photo-receptors in the eye. If there was no movement we would lose the ability to see any patterns at all within a few seconds as our neurons reached a stable state. Therefore, our eyes are designed to always introduce motion. Tiny tremors known as micro-saccades. There are always edges or changes in brightness moving across our field of vision. The patterns and timing of those changes, coupled with the direction of the saccade are what allow us to recognize objects.

Unlike all current still or digital photography methods, our eyes have no shutter and the spacing of the light-sensitive elements is not uniform. The motion-sensing characteristic is what allows our eyes to function without a shutter. And the distribution of cells in the retina allows us to see fine detail as well as a wide-angle view simultaneously, without resorting to the zoom-lens concept. Even when we are concentrating on fine detail in an object, our peripheral vision is protecting us from lurking cats.

These fundamental differences lead me to believe that the one-still-image-after-the-other movie approach will be replaced with a more appropriately designed technology.

One thing to remember is that the motion detection within the eye is really fast. On the order of one hundred times faster than the frame rate in a movie. And your eye position is an interactive part of the perceptual process. If I see an edge move across the movie screen in my peripheral vision I will register one thing. But if I am looking straight at it the edge will skip over so many receptors that I usually just take it for granted that it moved smoothly. In other words, I can tell it is an illusion. Even on very fast frame rates, like an IMAX movie. A big part of the problem is that film tricks like motion blur (as the shutter speed reaches the maximum for the frame rate) just introduce blurred blobs to the retina. There is no sense of direction, just there and not there.
The retina is designed to help figure out what direction an object is moving, and it does so in conjunction with the pattern of micro-saccades and the movement of your body. This is what allows a batter to hit a major-league fastball. He can actually see the seams and accurately judge the motion of a spinning, 2.86 inch diameter sphere coming toward him at over 90 miles per hour. The total time between the pitch and the time the bat must make contact is less than half a second. There is little chance that anyone could hit a fastball if they were allowed to see only a video or movie of the pitch. It is easy to call it after the fact. But actually seeing it and getting the swing down in time is one of the greatest challenges in all of sport.

* * *

The digitization of video images leaves a lot to be desired. Black and white (panchromatic) movie film had excellent sensitivity, resolution and dynamic range. When this was scanned to create old analog broadcast television signals using an old-style “film chain”, much of this dynamic range was preserved. In particular, blacks were black and showed a smooth transition to white.

The same cannot be said of modern digital signals and displays, even “high definition” ones. Invariably the sales and marketing hype will emphasize the brightness of the image, or the sharpness of selected scenes. Much of this “Wow” factor comes from unnatural adjustments of color saturation to make the customer think they have been missing something with the older technology.

One key to spotting the limitations that I am talking about is to observe scenes with highlights and deep shadows. Invariably, the shadow will exhibit a “posterization” effect: you can see the contours of digitization steps where the intensity changes by a single integer step. Furthermore, you may be able to spot a “blockiness” in the shadows instead of smooth contours. This is an artifact of the MPEG compression algorithm. Dark shadow areas are also subject to a “crawling” effect caused by slight variations in the way the MPEG algorithm renders the region from one frame to the next. I contend that the presence of these types of artifacts indicates that this is a technology where the compromises required to “get it to market” have limited the range of material that can be produced for the medium. New productions won’t suffer because the directors and cameramen know that shadows don’t work. And the old films have now become incompatible with the new media.

This incompatibility is far deeper and more fundamental than the much more obvious and annoying things such as different frame rates and different aspect ratios. All craftsmen strive to achieve quality work within the limitations of their tools. The extraordinary effects that are achieved in one medium may be lost in another. Film makers who, for example, use only the center third of their frame simply because it might eventually be shown on TV are doing a disservice to both their vision and their audience.

* * *

There are no synthetic vision systems that take advantage of either the shutterless concept for motion sensing or the fovea-based idea to give simultaneous zoom and wide-angle performance. All because the conventional wisdom says that perception needs a static image of the whole scene. And the flip-book idea is good enough for movies.

* * *

Grade school children are also taught all about primary colors and color wheels. Red, green and blue: primary colors of light. Magenta, cyan and yellow: primary colors of pigment. Simple concepts. It is how color TV screens and computer monitors work. It is how digital cameras work. It is how four color (with the addition of black ink) offset printing works. It is how color film and movies work.

The only problem is that it is not how our vision system really works. We are told that there are red-, green- and blue-sensitive cones in our retinas. These are differentiated by three different photopigments within the cells. Upon closer examination, that is only true in the broadest sense. Ten percent of men have only two different working pigments, thus exhibiting red/green color blindness. Up to fifty percent of women have a genetic mutation that produces four different pigments, thus yielding better ability to distinguish subtly different colors.

Many animals such as birds not only have four different photopigments, their cones include specialized droplets of colored oil that narrow the spectral sensitivity of their cones and add to the ability to resolve subtle variations in color. One reason pets do not respond to photographs or television as we might expect is that they perceive colors differently. An image that appears photo-realistic to us will have a cartoon quality to the animals.

No matter how much I fiddle with the white balance of my camera or the gamma correction of my monitor I will never be able to come up with a setting that allows my wife and I to agree on a color match. The trichromatic color technology is fundamentally flawed and needs to be revisited in a thoughtful way. We need to transition to full-spectral imaging without an industry-wide upheaval.

* * *

Our poor, abused grade school children are also taught all about stereo vision and depth perception. It seems obvious. You have two eyes. The angle between the two as you focus on an object gives you its distance. Ties right in with your geometry class. The only problem is that the effect is far too small to be of much use beyond ten feet or so.

A much more important effect is the parallax of near-field objects against the distant background. You get two slightly different views and perceive it as depth. You can even see the effect yourself with a stereoscope or View-Master.

These stereo effects are all valid, but they do not tell the whole story. You can get depth perception with only one eye. All you need is near-field objects and some motion. When you drive a car your head moves around. The hood ornament or fender or some dirt on the windshield is all that is needed. Using only one eye you can judge distances, park properly, etc. Your brain is fully capable of figuring it out with little training.

I have observed the way a cat’s eyes work. In particular, they tend to have eyebrows and whiskers that droop off to the side of their eyes. A little thought on the matter yields the realization that whiskers as near-field objects and micro-saccades give a cat depth perception in their peripheral vision using only one eye. In other words, a cat can be intently stalking a meal, looking straight ahead, and still be aware of exactly how far it is to the nearby branch it is passing. The combination of motion-sensitive peripheral-vision (non-foveal) photoreceptors, micro-saccades, whiskers, the target object and the background gives a tremendous amount of information. Processed by a an astoundingly capable visual cortex, this information allows a level of perception only hinted at by the grade school explanation.

There are many other things at work here. Unlike the modern photographic approach, nature has not attempted to keep the field of vision flat. Distortions arise in the single-element lens and in the spherical curve of the retina. This is not a bad thing - rather it is used to advantage to gain additional information about a scene. The eye rotates about an axis between the lens and retina. As the eye moves, these distortions will help to accentuate and outline nearer objects against the background.

Again, the problem with this misunderstanding of visual perception is that modern technology is only taking advantage of a tiny part of the capabilities inherent in all of us.

* * *

These observations have wide-ranging implications. What will an advanced generation of display device look like? How can we make objects with full-spectrum controlled color? Kind of like some sort of super-paint. How will this affect art? What about this interactive, non-static motion-sensing business? How can I design my art so that I control your perceptions? Draw attention to certain parts on a consistent basis.

What can this tell us about pattern recognition? Things oriented at odd angles. Floating in zero gravity.

What about facial recognition? I can easily spot my wife in a crowd. It is harder to do in a picture of her in a crowd, since I don’t get any motion cues. It is really hard in a video of her in a crowd because the camera’s point of view is fixed and the resolution is very low. Unlike the real world, focusing on a particular point on the screen doesn’t make that area any clearer.

Implications for symbology: Writing, fonts, markings.

Normal writing has a tremendous amount of redundancy. Words are written in a much more complex way than they need to be to convey the minimum information. Most English words, for example can be distinguished from one another by knowing only the letters they contain. The ordering of the letters just gives more, redundant information.

These words are easy to read.
Eehst dorsw aer aesy ot ader.
This is the principle behind the operation of the court reporter’s stenograph machine where each word is formed by simultaneously pressing certain keys.

How can we combine this observation with what we know about vision systems? How can we make fonts or markings more easily recognizable or less ambiguous? If we contemplate weightlessness, how can we make markings easy to read no matter what their orientation?

An early Artificial Intelligence system, based on a neural network, was designed and trained to recognize different aircraft as either NATO or Soviet. In the lab, it appeared to perform well but in the field it failed miserably.

The training was done using photographs from Jane’s Aircraft Recognition Guide. Further research showed that in the training photos, NATO aircraft predominately flew right and the Soviet planes left. The neural net, having no concept of the cold war, was simply figuring out which side of the picture had the pointy-end.

Monday, July 18, 2011

The Challenge of Conventional Wisdom

No comments:

Post a Comment