Saturday, September 24, 2011

On Inventory Management

Abstract

Managing the availability of goods is key to the sustainable future of civilization.  Making better use of materials and resources, eliminating needless duplication, and improving reuse and recycling can improve the lives of individuals, families and communities.  This essay examines the flow and availability of goods from the standpoint of inventory management systems.  I give examples of current systems, look at an ideal goal, and propose steps that could lead to both immediate and long-term benefits.  


This essay is a work-in-progress and will be updated periodically.  Other  related essays concerning image processing and object recognition will be posted as they reach maturity.  

Background

In my youth, my father had a machine shop and lab in the five-car garage area of our house.  We parked the cars in the driveway.  As I grew up, I spent much time exploring the boxes, bins, cabinets, shelves and assorted containers, learning about the objects within.  I learned to use the tools, built projects and conducted experiments.  As an inquisitive nine-year-old, I had examined almost every object and attempted to divine its use.  I read the Newark and Grainger catalogs while I fell asleep.  I had a strong vocabulary and could describe and name most any tool or part. 

In particular, I could accurately describe the location of almost any item in the shop.  Many items had multiple homes, as duplicates were encouraged and often grouped by project instead of into simple bins.  I could clean a work area and put tools, parts and equipment back in their normal places.  I could disassemble and reassemble things ranging from toys to lawnmower engines. 

I was, in effect, an inventory manager with skills superior to any "professional" system in existence today.

Let us examine the features and requirements of inventory management and suggest techniques that might bring the capabilities in the high-tech world up to the level of a small child.

Requirements

Very simply, we must keep track of objects in time and space. 

We must have some general idea of what we mean by an "object", and ways of recognizing and remembering properties.  This implies a data entry system with a method of rapidly assigning properties to objects.  These properties may be descriptions from catalogs or data sheets, observed properties such as size or color, and arbitrary manually entered information. 

Tracking, in its simplest form involves only "Get this object from here and put it there" concepts.  Manual forms of data entry, and simple scanners might be sufficient as a first step.  An automated system would probably observe an area and recognize objects as they enter and leave. 

We should be able to answer questions like:
  • "Where is the nearest ...?"
  • "How many ... do we have?"
  • "How long have we had ...?"
  • "Where has ... been stored?"
  • "Is ... safe to handle?"
  • "Does ... need to be right side up?"
  • "What does ... attach to?"
  • "How does ... need to be stored?"
  • "Do we need to order more ... when this is used up?"
  • "Is ... more expensive than ...?"
  • "Is there anything special about ...?  Is it rare or valuable or dangerous or fragile?"
  • "What is ...?  What is it used for?"

The system must be so easy to use that it will be part of everyday life.  An assistant that can answer accurately when you wonder where you left the car keys.  A retail checkout system that does not need barcodes. 

The overall system must be tolerant of bad or conflicting data.  Over time everything should be generally self-correcting.

Do not necessarily require a "Parts in Bins" organization.  There may be preferred locations so that tools tend to wind up in tool boxes but it should not be carried to extremes.

The database should:
  • allow for object identification,
  • retain arbitrary properties,
  • track current location and location history,
  • group objects during storage or use,
  • allow for assembly and disassembly of composite objects
Visual Object Recognition

Current barcode scanners beep when they successfully scan an object.  As far as I am concerned, a proper scanner will only beep when it sees an object that it does NOT recognize.  I.e. we should eliminate the unnecessary confirmation noise.  Identification should be so accurate and so routine that the only thing that should need the user's attention is the true exceptions.  


A forthcoming essay will focus on the requirements of visual object recognition systems.


There is a range of requirements from the most basic detection of visual features within a background of clutter all the way through the comprehensive integration with a central object-location-tracking database.  This is required to ensure accurate identification of a particular object, not just the kind of object.  


Selecting the pencil laying on the notepad in front of you is almost always preferable to selecting an identical pencil from the pencil holder.  The history of the object is as important as its location, and, in general, history requires the combined recognition and tracking of multiple visual sensors.  

In a world of ubiquitous, distributed visual recognition systems such as foveal cameras, each camera develops a learned history of particular features that compose and are associated with particular objects.  The different histories ("experiences") of each camera means that their library of recognition templates will be unique.  And yet, we want to be able to assign the same "identity" to objects as they move from one camera's area to the next.  This implies that there should be an "object template description" that is both compact and sufficient to (more or less) uniquely identify a particular class of object. This data is what would normally be communicated with the central object-location database, and with other nearby cameras to aid in tracking particular objects from one station to the next.  

Consider: trying to locate a particular individual  using the cameras in a shopping mall.  Start with a general description such as "short, fat guy in a red suit".  This is actually a LOT of information expressed very succinctly.  It lops out most of the objects from your recognition database and allows attention to be devoted to the most likely suspects.  Maybe a candidate is seen from one point of view and you add to the description: "he has shiny black boots".  Motion tracking and adjacency ensures that this is the same individual.  You are building a more complete description.  Another view: "He has a white beard".  Multiple observers watching from different  cameras share the ability to casually recognize these high-level features and need ONLY the general location and compact description to be reasonably assured of success. 


Modern Examples

The inventory at a WalMart retail store is intended to be in near-constant motion.  Trucks with assorted merchandise arrive at the back doors.  Products are rapidly distributed to essentially arbitrary locations within the store for presentation to customers.  Customers roam the store selecting desired items.  Items thus selected are scanned, purchased and removed through the front doors.  Approximate item-counts are maintained by using a "delivered minus sold" algorithm, but this becomes so inaccurate over time that periodic physical inventories and complete overall reorganizations of the store are necessary. 

If I visit a hardware store I usually expect to be able to find a knowledgeable employee and say something like "I need a bigger one of these", or "This wore out and pieces broke off.  Do you have any more?", or "I need to mount this on a brick wall.  What do I use to do it?"  The employee is expected to be able to recognize my object and its use, match it against items in his experience using arbitrary criteria, and give me a meaningful response within a few seconds. 

Typical large companies manage warehouses and stock rooms with bins, shelves, cabinets, etc. and try to ensure that all like objects are collected in one place.  This facilitates locating desired items, counting stock, providing an appropriate storage environment and ensuring that replacements are ordered in a timely manner.  Frequently in-house part numbers are created and assigned to the storage locations to help with this process.  Unfortunately, there is often a many-to-many relationship which allows many different vendors to supply the product that winds up in a particular bin, and the same exact product may need to be stored in different locations due to convenience or necessity.  The computerized inventory system most likely attempts to enforce an idealized "one part number, one location, one quantity" paradigm.  More advanced or customized systems tend to be increasingly unwieldy due to special cases and exceptions and the need for more operator training. 

My wife appears to have a diametrically opposite approach.  I have often observed an apparent equal probability that a particular item will be in any of the boxes, bins, shelves, cabinets, closets, and drawers in the house.  This is actually not an outlandish situation if you have a good memory and are able to accurately and rapidly communicate enquiries and responses: "Where's the glue?" "Elmer's is in the bin under the bed.  Contact cement is on the top shelf of the linen closet."

On the International Space Station inventory items tend to be stored in bags within bags.  In the micro-gravity environment there is generally no need for rigid containers.  Objects can be stored compactly in collapsible bags, packed into storage spaces, and gently secured against air currents and slight accelerations using bungee cords.  Inventory access problems usually revolve around trying to figure out "which bag?", "where?", and "how do I get to it?".  This frequently involves lengthy conversations with the ground controllers who are responsible for trying to record ongoing activity and look up records from past operations. 

Homer Simpson has a garage full of lawn and garden equipment.  It is all labeled "Property of Ned Flanders".  The traditional view of this satire is that Homer is a thief who never returns items that he borrows.  I, however, would contend that Ned has simply found a way to store his excess inventory in Homer's garage.

Other Recognition Techniques

In these essays, I focus on visual recognition systems.  As a child, I was not so limited.  We had recently moved, so much of the material in the garage was packed in various boxes.  One day I was going through a box of vacuum tubes -- ancient electronic components that are kind of like fist-sized, glass transistors.  All of these tubes were wrapped in newspaper packing material.  As I picked up one of the wrapped objects, I knew immediately that it was not a tube.  It was a bottle containing about a half-pound of mercury.  In fact, I knew that it was mercury before I unwrapped it.  And I had no prior knowledge that we even had a container of mercury. 

This is an example of what I consider a proper an object recognition system in operation.  The object was manipulated by a sensitive tactile handling system.  The manipulation system safely transitioned from working with glass containers of vacuum to glass containers of mercury.  The feedback from the system instantly provided object-density estimates.  Movements allowed me to detect that the object was not only NOT solid, but that the contents were fluid.  Silence during the manipulation allowed me to deduce that this was not a jar of washers or nuts.  The fact that, during rotation, the center of mass did not shift as one would expect with a granular fluid also helped narrow the identification. 

The tentative identification became clearer as the bottle was unwrapped.  During this short time the entire system changed.  Dropping a newspaper-wrapped vacuum tube is a non-event.  Dropping a bottle of mercury is a whole different matter.  Even in the days before California-inspired paranoia concerning heavy metals, a child could be concerned about making a Big Mess.  My casual attitude became much more focused.  My grip became firmer.  My posture became more stable.  In short, the discovery led me from idle curiosity to attentive excitement in just a few seconds.  All without a single visual cue. 

Implementation

In light of this background, I contend that inventory management should be an automated, continuous, interactive process.  By "interactive" I mean that the inventory management system should physically interact with the items that it is managing, much as I did as a child, or as the hardware store employee does to become good at his job. 

This would allow the updating of inventory data to be treated as a routine maintenance operation instead of an inefficient, disruptive quarterly or annual event.  Managing objects in boxes (or boxes of objects) is only sufficient if there is a complete prior understanding of the actual, individual objects.  

Applications

Although I am describing this as Inventory Management, there are many applications.  The Inventory that we are managing need not be simply nuts and bolts.  For example:
  • Identify people and track their movement
  • Production operations in a manufacturing facility.  Time and Motion studies.
  • Ensuring "Appropriate Redundancy" of tools and supplies.  Not too many and not too few.
  • Transportation, Cargo and Freight operations.
  • Restaurants, Food Services and other Just-In-Time manufacturing
  • Produce tracking for food safety
  • Infrastructure Maintenance - Buildings and Utilities
  • Construction Industry - On-site Manufacturing and assembly
  • Health Care and Pharmaceuticals
  • Records Management - Customers, Patients, ISO 9000, etc.
  • Libraries and Collections 
And maybe we turn the whole thing around.  Make a geolocation system by mounting the cameras on some of the objects and use them to watch the surroundings.  No more reliance on a fixed infrastructure. Recognition of objects and places are just two sides of the same coin, going into the same database.  










Notes

Keep track of items in time and space.
Object identification - data entry, description, photo(s), size, mass, etc.
Object tracking - manual / automated
Object status -
            storage in bag, box, etc.; conditions (temperature, etc.)
            usage - quantity is partially used (count in / count out) liquids, aerosols, etc.
            assembly - object becomes part of something else
            disposal / recycling / disassembly - including damaged or incomplete items
            movement - new object / new location
Object query -
            find nearest
            find totals
            find expired (drugs, milk, etc.)
            find history - objects / locations
Must be so easy to use that it is ALWAYS used for Get and Store operations
Bill of Materials - Object associations or groupings
Nesting - recursive objects within objects
Inventory - continuous update / verification of object info when any storage location is accessed
            best if automated
            Important to detect unexpected objects
            Automated recognition.
Database -
            Object identification
            Object history
Photographic recognition
            Introduction process - controlled observation and examination
                        Multiple views and lighting
                        Unique feature extraction
                        Associate with similar objects
                        Establish photographic scale and allow scale invariance
                        Record markings or other identification features
            Do not require arbitrary categorization. 
                        Let the recognition engine make its own categories or groups.
            Do not expect perfect identification
                        "Narrowing it down" should be good enough
                        Combine with location history to complete the identification
Do not necessarily require "Parts in Bins".  Items can be anywhere.
            Preferred storage locations may help ensure (toolboxes, etc.) are properly stocked
Locations <--> Parts should be a Many-to-Many relationship
Tolerant of bad / conflicting data.  Generally self-correcting.
Examples
            Hardware store
            Borrow a cup of Sugar
            Craigslist
            Ned Flanders
            Tracking Santa Claus
Maintain orientation.  Don't spill it.
Disassembly -
            What is in it.  Hazardous?
            What is this part of?
            Survival inventory.  Motors contain coils of wire...
Expiration dates.  Use oldest first vs. Use freshest first.  Frozen bread anecdote.

Object history and current status.  A maid that always moves the dishes from the table to the dishwasher and THEN to the cabinet. 

Automated explorer manipulates objects to conduct inventory and cataloging.  Dangers include hazardous chemicals, high voltages, heavy/unstable objects, sharp tools, firearms, fragile objects, falls from high places, rotating machinery, buttons, switches and knobs, insects and pets.

Current systems and limitations
            Hospitality industry
                        Large number of identically furnished rooms
                        Maid service touches every object daily
                        Common maintenance, purchasing and disposal operations.
            Apartments, Condos and Tract homes
                        Many redundant objects
                        Progressively less commonality
                        Seasonal storage

The maid knows:
            Clean the nightstand
            Leave the lamp
            Wash the dishes
            Do not wash the paperback book

No communication regarding object location or availability
          You have to ask for it before the system will tell you anything
          System should be proactive and push appropriate information to to in advance of need

CraigsList
            Arbitrary descriptions are a problem
            Locations are hidden until a query is made

Monday, September 12, 2011

What is a Source File?

Abstract
Source Files are ubiquitous in the world of software development. Little thought is given to their core technology concept, which is now more than fifty years old. The transition from a desktop-PC to tablet-based computing environment represents a sweeping paradigm shift for software developers. Rethinking the true requirements of software from the Source level onward will allow more modern tools to enter the development arena. This essay presents an analysis of the situation from a historical perspective and proposes new methodologies for use in the future.

Introduction
The concept of the Source File is so fundamental and so ingrained in the lives of software developers that there is seldom any thought given to its true function.

Historically, the program source was simply the human-readable form of input to a language compiler, assemble or translator. The most important aspect of Source Code was that it be able to be fed to the language processor without generating errors. This meant that virtually everything about the process of program creation was geared toward simplifying the task for the compiler. Combined with the early data entry methods - Hollerith cards , teletype machines and paper tape - the idea of a program source became established.

And now we find ourselves, fifty or more years later, expecting program source to look like it could be printed on a teletype machine. Eighty columns (or so). Fixed pitch fonts. Nothing but ASCII characters.
The scope of programming requirements has changed radically from the early days of computing.

Programming languages and methodologies have evolved in many directions - some useful, others not so much. Our largest projects still suffer from the "simplify it for the compiler" mindset. A prime example is the ever-present header files used by most modern programming tools. These redundant and hard-to-maintain files are used to provide descriptions of the required linkages in modular programming. Newer programming environments attempt to deal with the header/linkage problem with added features or conventions that allow the tools to handle much of this drudgery. Handling the header problem is only one of many steps that need to be taken to convert from a "what is easiest for the compiler" to a "what is easiest and least error-prone for the programmer" mentality.

What Are Source Files?
A Source File is a collection of text (variously known a syntactic symbols or tokens) that are either:
  • Human language, as in comments 
  • Directives, usually for the top-level language processor that uses the particular Source
  • Programming Language, the statements or syntactic constructs that we think of as the program itself  
  • Literals, as in data which will be manipulated by the program but which is encoded in some fashion and stored within the Source.
Note that String Literals almost always contain some form of data that could be described as Source Code. For example, the classic "Hello, World!" program used in beginning programming classes is a program which manipulates data in another language - in this case, English. Most string manipulation in modern programs exists solely for building elements of another programming language, which will be fed to another language processor.

In addition, many Source Files today contain sections written in many different programming languages.  For example, an HTML file might include CSS, JavaScript, HTML and English.  Modern text editors attempt (with varying degrees of success) to help sort all this out for the developer.  Highlighting, auto-completion and various warnings are used to help prevent the (sometimes spectacular) errors that result from feeding one language into the processor that is expecting another.  
I will make a brief sidebar to rail against the very poor implementation of what is referred to as "internationalization".  If a modern application needs to be relevant to a world-wide audience it is expected that the developers isolate all locale-specific aspects into separate "resource" files, to use localizable operating system calls that can never be evaluated on the developer's system and use character sets and string manipulation tools to support character sets display, and keyboard modes that also cannot be tested on the developer's equipment.

Internationalization should be inherent in the program development process - not scabbed-on to a final product.  International test cases should always be visible to the developer and the operation and aesthetics of the product should be visible in a simple and robust manner at all times. To achieve this, the adaptive, multi-lingual keyboards on tablet computers, as well as the world-wide collaboration techniques discussed here become critically important.

Consider the notational nightmare that results from trying to pass a literal CSS property name to a JavaScript function that is being invoked from an HTML event handler.  The nesting of escaped quotation marks converts a (relatively) simple concept into something that requires deep understanding by the programmer.  And is thus virtually impossible for independent developers to modify or verify.

Unfortunately, one only gets help from the editor in the static situation such as the hand-written web page.  If the program needs to dynamically generate code (such as SQL queries) it becomes much more difficult to test, debug and verify.  There are virtually no tools that address this dynamic code creation problem, although it is one of the most common tasks performed today.  Many ad-hoc build-a-string techniques are used, but never with any consistency or robust certification.

How Are Source Files Used?

Source files seem to be used in four distinct but interrelated ways.
  • Design
In the Design phase the source is used as part of the collaboration and documentation for the project.  When properly done, this provides a good part of the framework that multiple developers use to ensure accurate understanding of the requirements and to convey details of the implementation as it progresses.
  • Edit
The Edit phase is where most human interaction with the source occurs.  This is where much current effort in man-machine interaction is focused.  Making better editors with greater knowledge of program internals and the possible intentions of the programmer has been the goal since the first Integrated Development Environments (IDEs) were created.

Big, fancy IDEs that need big, fancy displays are moving in the wrong direction.  As the computing world moves away from the PC-centric toward the tablet-based future, our interaction will become much more focused.  The display-and-keyboard will adapt to the operations that we really want to perform, and will retreat from the overwhelming offering of everything we might want to do.  Adaptive keyboards, touch screens and more compact, high resolution presentations provide opportunities to completely rethink the developer interface. 
  • Build
Ultimately, the purpose of the program source is still to create a working application program.  This means that the Programming Language parts of the Source must be submitted to a language processor.  In many cases, this is not a trivial operation.  Many separate functions or modules must be combined, their linkages validated and a final product produced.  This build process can be time-consuming and can result in errors that are the ultimate responsibility of diverse members of the development team.

Expecting the build process to be performed on individual desktop computers is something that needs to be reevaluated as we move into the 21st century era of cloud storage and cloud computing. 
  • Test
After a successful build developers generally expect to test the new application.  This may require transferring the newly created code to a test environment, which may be a particular hardware device or a software simulator.  There should exist a suite of test cases for the "finished" application.  Although some things can be automated, in many cases manual interaction and visual aesthetics will be important factors.  Handling these in a consistent manner, and ensuring that full testing is actually performed for each build or release candidate is a serious problem in the development process.  Maintaining and documenting both known-good and known-bad test cases is of critical importance and can be overwhelming for complex projects.

The source-level run-time debugger is one of the great advances in PC-based computing and is a major feature of all Integrated Development Environments.  Unfortunately, the multiple-programming-language nature of modern programs limits the actual usefulness of the Debugger.  While I occasionally use a debugger for certain compute-only functions, and neophyte programmers really benefit from the capability, I believe that the  problems of client-server interactions, dynamic scripts and cross-platform compatibility are more important challenges that cannot be addressed by a simple debugger.  Therefore, the debugger capability should not be viewed as being of critical importance when evaluating future development tools and environments.

I believe that every function or statement should have available an easily-accessible library of test cases.  This would be sort-of like using a current debugger to run to a breakpoint and then be able to examine and modify the data structures as they are processed.

Knowing the working-set (data structures that are accessed or modified) used by by any particular function is of critical importance to verifying the correctness of a program.  In general, the compilers know this, although sometimes it cannot be determined until run-time.  Unfortunately, this critical knowledge is never made available to the developer.

The test-case library would take the place of ancillary test programs used during the development process.  Currently, such test programs are created, as-needed, by individual developers.  They are never documented, maintained or shared.  Even worse, they are often discarded once the developer feels that his function is "working".

The concept of Assertions in various programming languages are used to catch errors at run-time when selected data does not match expectations.  I would extend the Assertion concept by allowing the capture of a function's working-set - at run-time - and saving it as an addition to the test case library.  This allows the collection of robust sets of real-world data.  When the function is modified or replaced with code that should be equivalent, these collected test cases can provide input to an automated verification process.

What Does the Future Hold?
 
"Keyboards" should be viewed as "Token Selectors" instead of letter-by-letter token builders. IDEs try to do this in some cases, but fail because of (1) the multiple-language problem, and (2) the "word-ish" nature of tokens needed to convey meaning to people.

A single font size doesn't fit all.  Nested structures could be displayed with smaller fonts, so you could literally zoom in to move deeper into the code.  Current tree-collapsing displays give an essentially useless all-or-nothing display.

Need more screen space?  Get a second iPad.  Properly done, this could not be more expensive than multiple monitors and should be much more useful in the general sense.

All source should reside in the cloud.  Current Version Control Systems make much ado about "Checking Out" and "Committing" changes to modules.  This makes it (intentionally) impossible for multiple developers to share work on individual modules.  Storing all development trees in the cloud and changing the checkout process from a "download it to my computer" to a "mark this group of changes as pending" in the cloud concept would be a great improvement.

All Build operations should be performed in the cloud.  There should be no need for any compute-intensive operations at a user's desktop.  This allows high-powered, dynamically-allocated resources to be brought to bear on what should be an independent background task. 

What Could Replace Source Files?

Given the ways current source files are used, the multiple-language problem, the need for collaborative work, and the trend away from desktop computing it seems obvious that we are actually referring to a Database application.  We have been using the computer's simple file system to store and access our programs, even though that has never been a particularly appropriate technique.

What is needed is a Cloud-based database storage system which is accessed by multiple Very Thin Clients (read iPad Apps) that provide the user interaction during the software development process.  This inherently allows world-wide collaboration among the development team.

The compile and build process is performed as a Cloud-based service which provides additional elements to the Database.  These elements would include error and diagnostic information, compiled code, linkage information and target application code.  Automated test cases could be run and results verified.  All these capabilities would be available to every development team member, without having to have separate instances of hardware, tools, environments, etc.

I have some very specific ideas about how such a specialized software development database would be structured, implemented and accessed.  Of course, it has nothing to do with relational databases, SQL, or  traditional Client-Server techniques.

 I Am Requesting Some Feedback

For now, I am interested in feedback from as diverse a set of developers as I can manage.

Specifically, please let me know how you develop you code.

What types of applications do you develop?

What hardware do you use?

What size screens do you need?

What applications do you need to run?  Editors? Compilers? Debuggers? Simulators? Browsers? Documentation tools? Email?  Chat?

How many things do you try to do at once?  How many things do you absolutely need to do at once?

If you were going to work using a tablet computer, what would you see as advantages?

What would be disadvantages, even given your idea of a perfect tablet-of-the-future?

What do you think would be impossible to do using a tablet?


Monday, August 8, 2011

Cargo Cults

This is a favorite quote, which I think is even more applicable today that it was 35 years ago.

In the South Seas there is a cargo cult of people. During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So they've arranged to imitate things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas--he's the controller--and they wait for the airplanes to land. They're doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn't work. No airplanes land. So I call these things cargo cult science, because they follow all the apparent precepts and forms of scientific investigation, but they're missing something essential, because the planes don't land.
            – Richard Feynman
            Caltech Commencement Address, 1974

Thursday, July 21, 2011

Malevolent Social Engineering in Open Source Software

The following paper was written in October of 2010 and distributed in draft form to several infrastructure security organizations.  Not a single one made any response or gave any indication that they are considering the problem.  I now publish it openly for indexing by any search engines that happen along.  The date is 21 July 2011.  How long will it be before we hear about these types of attack on the evening news?

The examples are written in a form that many readers may enjoy.  I hope that I have conveyed some sense of the ease with which even quality programmers may be duped by crafty opponents.  Toward the end of the paper things get rather technical.  Don't worry -- when you get to parts don't understand you have probably gotten all you need. 



Malevolent Social Engineering in Open Source Software
Brian McMillin

Abstract
A hypothetical attack on core open source software technologies is presented.  Extreme danger lies in the fact that these potentially compromised core technologies may be incorporated into an almost unlimited number of different application programs, unknowingly created and marketed by unrelated organizations which may be completely unable to determine if they are distributing malware. Mitigation strategies are discussed, although none are anticipated to be effective.

Background
Open source software is an increasingly important method of developing modern applications and tools.  In many cases the collaborative work of different authors provides for new features and qualified review that would be impractical for any corporate effort.  The wide availability, ease of use, and inherent peer-review of open source packages makes them tremendously appealing to virtually all developers.

Unfortunately, it is this very collaborative nature and peer-review that opens the door for social manipulation and creating the illusion of quality and safety while masking malevolent software.

The term malware is usually used to mean intentionally malicious software designed to compromise a target system.  From the user’s perspective there is little difference between a system that fails due to deliberate machinations and one which fails simply due to buggy software.  Accidents and intentional attacks have the same effect.  In this analysis I treat both cases equivalently.

Social engineering is a term associated in the public’s mind with spreading computer viruses via email.  Disguising a threat with some desirable or benign coating (a picture of Martina Navratilova, or a valentine from a secret admirer) causes the user to circumvent the computer security system.  A threat that causes a panic reaction can thwart common sense: “Your computer has a VIRUS! Click here to fix it.” or “Your account has been suspended due to suspicious activity.  Click here to sign in and review your transactions.”

Social engineering in the open source software development community can take many forms.  Popular tools or packages, a friendly author in a forum, beta testing opportunities, web-based code snippet libraries - all can be the source of code which may fail to receive the scrutiny that it should.

Modern software systems are far too complex for any individual or organization to adequately evaluate, monitor, test or verify.  Malevolent code can be incredibly compact - sometimes requiring only a single character.

Examples

    Let me begin by emphasizing that there have been no known intentional examples of the use of these attacks to date.  The examples describe accidentally introduced bugs in actual software that could be devastating if carefully placed by a knowledgeable adversary.


Example: Single-Bit Date-Time Bug

The firmware for an access control system was found to contain a single-character typographic error which had the effect of rendering day-of-week calculations inaccurate.  The error was discovered during testing, only when employees were unable to enter the building on the first Monday of April.  Supervisor access was not restricted.  Troubleshooting revealed that the system believed that the date was part of a weekend.  The erroneous line of code is reproduced below.

    DB    31,28,31.30,31,30,31,31,30,31,30,31    ; Month Lengths

This example is particularly noteworthy because the error (substitution of a period for a comma) is actually only a single-bit error in the ASCII encoding of over 300K bytes of source code.  The keys on the keyboard are adjacent, and the difference in visual appearance of the characters is minimal (and was barely discernable on the displays being used).

The error was not discovered during initial development testing because it was dependent on future values of the real-time clock.  Code review did not catch the error because reviewers focused on verifying the “important” things - in this case the sequence of numeric values, and took for granted the punctuation.  The assembler software failed to indicate an error because an obscure syntactic convention allowed the overloaded period character to be interpreted as a logical AND operation between the two integer values.

The line between accident and malice can be quite fuzzy.  This ambiguity can allow a knowledgeable adversary to obscure an attack by hiding it among hundreds of lines of well-written, clean code.  Furthermore, since the malicious nature of any error can easily be explained as human error, the attacker remains free to try again if discovered.  The peer-review process may even be commended for finding and correcting the error, while giving the adversary additional information to improve the next attack. 



Example: Intel FDIV Bug

The Intel FDIV bug was an error in the floating point division algorithm in certain versions of the Pentium processor.  Apparently, the actual underlying error was confined to five cells in a lookup table that were unintentionally left blank. 

The effect was that software running on these processors would occasionally receive computational results which were in error after the fifth decimal digit.  Subtle errors such as this are extremely difficult to detect - in fact it took a skilled number theorist with great tenacity several months to isolate and demonstrate the problem. 

In any case, it would have been far easier to have certified the processor correctness at the design stage by ensuring that the lookup tables were computed and verified by multiple independent sources prior to production.  In fact, by the time the bug was publicized, Intel had already produced processors using the same algorithm which were free of the bug. 

Thomas R. Nicely discovered and publicized this bug during 1994, and in December of that year Intel recalled and replaced all affected processors.  In his analysis of this situation, he concludes:
Computations which are mission critical, which might affect someone's life or well being, should be carried out in two entirely different ways, with the results checked against each other. While this still will not guarantee absolute reliability, it would represent a major advance. If two totally different platforms are not available, then as much as possible of the calculations should be done in two or more independent ways. Do not assume that a single computational run of anything is going to give correct results---check your work!

The legacy of this bug is that, fifteen years later, many development tools for Intel-based software still include conditional code relating to the recommended work-around for this hardware error.  The mentality that justifies this as “Oh, that’s always been there”, or “It doesn’t do anything, but it can’t hurt” is a symptom of a larger problem.  Unnecessary code is always harmful.  At the very least it allows extra opportunities for undetected corruption to occur.  It fosters an un-critical mind set in the reviewer. 

Code paths that are never executed can never be tested.  Their presence in modern production code should be considered suspicious.  Hiding malware in such untested but ubiquitous code potentially allows for its wide distribution.  Dormant code such as this needs only a suitably crafted trigger to affect all compromised systems.

This five-value error caused an economic impact of $500 million to Intel in 1995, and is still being felt in unquantifiable ways today. 


As the Iranian nuclear program found out, to its own detriment, it is never a good idea to run unverified industrial control software on your black-market enrichment centrifuges.

The precision bearings tend to eat themselves for lunch.

What if the table errors in the floating-point algorithm were not as blatant as being left zeroed out? What if the tables contained a carefully selected number of random errors? And were embedded in the floating-point unit of a counterfeit GPS chip? And that chip happened to find itself in the terminal guidance system of an opponent's missile? And that the only effect was to change the rate of successful position updates from 100 per second to one per second? And the CEP (circular error probable) for the missile went from 3 feet to 3000 feet?

This kind of thing could win or lose a war.

And how could one ever expect to detect such a deeply-embedded, subtle attack?

How much would such an attack cost?

Would it be worth it for an adversary to try?

Example: NASA End-Of-Year Protocol

NASA space shuttles use a voting system of three active and two standby computers for flight control operations.  Each of these systems is intended to be essentially identical, and each runs identical software.  The intention is to detect and mitigate hardware failures, as these are deemed to be the most likely source of problems during a mission lasting two weeks or so.

Even so, flight rules prevent any shuttle from flying on New Year’s Day, since it is well recognized that the operating software cannot be positively certified to operate correctly when the year changes.  This is especially true when the shuttle orbiter is viewed as a small part of the much larger system involving communications, tracking, navigation, and planning systems which are geographically distributed throughout the world.  Ensuring that every component of this worldwide network will be free of anomalies when the year changes is viewed as an insurmountable problem and an unnecessary risk.



Example: McAfee Automatic Update

On April 21, 2010 McAfee Software released an update to its anti-virus software which incorrectly identified legitimate SVCHOST.EXE operating system files on Microsoft Windows XP systems as the W32/Wecorl.a virus.  Affected systems were locked in an endless reboot sequence and required manual intervention in the form of a local data load by a knowledgeable person to recover.

At least one police department instructed its officers to turn off their patrol car computers to protect them from the McAfee update.  It is unclear why every patrol car should have been running anti-virus software in the first place.  Much greater security and performance could be gained by closing the department’s network and installing proper protection at the gateways.

Anti-virus software is by its very nature a social engineering phenomenon.  The threat of malware and the lack of confidence in our legitimate operating systems and software has led us to the perception that we must install software which slows performance and causes unpredictable and non-deterministic behavior under normal circumstances.  The fact that perfectly good, working systems can have their behavior altered by anti-virus updates on a daily (or perhaps hourly) basis is, in itself, a source of great concern.

The fact that updates are allowed to proceed in an automated mode may be acceptable or even desirable for consumer products.  For dedicated applications or mission-critical systems there is little justification for automatic updates. 



Example: Adobe Flash Player

Much controversy attends the question of Adobe Flash player and HTML 5 features on mobile devices such as iPhone.  It has been claimed that the Flash player is buggy, a resource hog, and responsible for many system crashes.  The Flash player is a proprietary piece of software implementing a proprietary standard.  It is difficult to understand why the open source community, principally revolving around the Android operating system, seems to be more vocal in their support of Flash than Apple, who champions the open source HTML 5 standards. 

In reality, the controversy appears to be an example of social engineering, designed to allow a proprietary standard to maintain dominance in an evolving marketplace. 

It is true that MacroMedia (now Adobe) filled a real and important need by developing Flash in an era when no standard mechanism for animation or user interaction with computers existed.  The time has come for such ad-hoc early forays into user interfaces to yield to more mature, carefully designed systems that incorporate the best features discovered so far and meet the requirements of modern systems.

Proprietary systems will always be more vulnerable than open systems due to the limited resources and unknown business priorities of the controlling company.




Example: Zune 30GB Music Player Leap Year Bug

On December 31, 2008, all Microsoft Zune 30GB Music Players failed during the boot sequence.  The software that failed was the Real-Time Clock driver firmware for the Freescale Semiconductor MC13783 Power Management and Audio chip.  Near the end of the boot process, the driver was called to convert the internal Days and Seconds representation of the current time into Year, Month and Day.  On the 366th day of the year, the year-conversion loop would fail to exit, thus causing the device to hang permanently at that point.  The work-around was to allow the batteries to run completely down and to wait until the next day to restart the device.

The problematic driver software was contained in the rtc.c source file provided by Freescale Semiconductor to customers of its products.  The ConvertDays function was missing an else break; statement which would have correctly terminated the loop.  Using the normal formatting conventions adopted by Freescale, this would probably have added two lines to the 767 lines in this file.

A second function in this same file, called MX31GetRealTime, uses exactly the same loop structure for year conversion and includes diagnostic message outputs, apparently intended for verifying the calculations.  In the day 366 case, this code would output the (incorrect) message “ERROR calculate day”, and then break the loop.  In other words, if Freescale’s own diagnostics had been used to test the code there would have been a single suspicious message among a flurry of output, but the diagnostic code would not have hung.  If the real code had been tested or simulated on the correct date, the hang would have been discovered.

Note that the chip in question is called a “Power Management and Audio” chip.  Page 2 of Freescale’s Data Sheet lists 17 features for this chip, including battery chargers, regulators, audio amplifiers, CODECs, multiple audio busses, backlight drivers, USB interface and touchscreen interface.  The Real-Time Clock is item 13 of 17 on this list. 

It is clear that this is an example of a catastrophic bug in a “trivial” function, buried deep within mountains of code implementing “important” features.  This code was provided by a trusted supplier.  The features of the chip are so complex (and proprietary) that users (in this case, Microsoft) have little alternative but to accept the supplied code without exhaustive or critical examination. 



Example: Sony Root Kit

In 2005, Sony BMG Music released over 100 titles of music CDs that surreptitiously installed rootkit software on user’s computers running Microsoft Windows.  The alleged purpose of this rootkit was to provide copy protection for the music, but in actuality provided cloaking technology and a back door for malware.  Prior to legal action and the eventual recall of all Sony CDs with the XCP technology, over 500,000 computers were compromised.

The corporate mindset at Sony that viewed their own consumers as an enemy, stark terror in the face of declining sales, and a total naivety concerning computer technology left them vulnerable to manipulation by groups selling Digital Rights Management software. 

In the case of XCP, it also demonstrated that anti-virus services can be manipulated simply by the choice of names used by the malware.  Because it was being distributed by a giant corporation and was covered by the aura of anti-piracy claims, the anti-virus services spent more than one year allowing the infestation to grow.  This despite the fact that, in all respects, the software behaved maliciously by (1) being loaded from a music CD, (2) replacing system files, (3) cloaking registry entries and (4) conducting clandestine communications with a BMG host computer.



Sidebar: A Tirade Against Digital Rights Management Software

Digital Rights Management software may be viewed as malware, in that its purpose is to selectively block access to certain data or programs using arbitrary and unexpected rules.  Any software that behaves differently on one machine than another, or that works one day and not the next, should be viewed with great suspicion. 

DRM software is operationally indistinguishable from malware.  Test and verification of DRM software is, by its very nature, difficult for its own developers.  In addition, the presence of DRM features on a particular system makes the performance of that system essentially impossible to certify. 

Any software that cannot be backed up, restored, and made fully operational at an arbitrary point in the future should not be allowed in a professional development environment.  Software that includes timeouts, or that requires contact with a validation server is not reliable.  Any software whose continued operation is subject to the corporate whims of third parties is fundamentally unsafe.

Programs that include behaviors that are dependent on hardware identity (station names, MAC addresses or IP addresses), date - time values, random or pseudo-random numbers, and cryptographic codes are inherently difficult to verify.  If at all possible, these features, where required, should be carefully isolated from as much of the production code as possible.

Since there can be no universal guarantee of network connectivity or the continued operation of a central server (such as a licensing server), I would argue that any software that implements “time bomb” behavior or otherwise deliberately ceases to function if it does not receive periodic updates should be banned.

Experience has shown that DRM software is generally ineffective in achieving its stated goal, and causes undue hardship to legitimate users of the product.  Development efforts would be much more productive if they were directed toward improving the experience of all users, instead of trying to restrict some users.



Example: Physical Damage to Memory

In the late 1960's the DECsystem 10 used core memory for its primary storage.  There existed a memory diagnostic program designed to find errors in this core memory array.  The diagnostic proceeded to repeatedly read and write sequential locations.  It was found that this diagnostic would almost always find bad locations - even in known good arrays - and that entire rows would be genuinely bad after the diagnostic ran.  Investigation proved that the continuous cycling of the three Ampere (!) select current pulses were physically burning out the hair-thin select lines in the array.

The memory design engineers had known of this possibility, but discounted it as a failure mode because the system was equipped with a semiconductor memory cache that would prevent repeated operations on the same address.  Naturally, the designer of the memory diagnostic included instructions that explicitly disabled the cache. 

Forty years later, our most modern portable devices use high density NAND flash memory as their storage mechanism of choice.  Flash memory relies on the storage of small quantities of electric charge in tiny cells, and the ability to accurately measure that charge.  In order to store new values in this type of memory, entire pages must be erased and then sequentially written.  The 16GB flash memory used in the iPhone 4 (for example) stores multiple bits in each memory cell using different voltage levels to distinguish values.  The ability of these cells to reliably store and distinguish bits begins to degrade after only 3000 page erase cycles.  Elaborate hardware and software mechanisms exist to detect and correct errors, and to provide alternate memory pages to replace failed areas.  In order to achieve acceptable production and operational yields and longevity, modern error correcting systems are typically capable of correcting 12 or more bit errors in a single block.   Furthermore, wear-leveling algorithms attempt to prevent excessive erase/write cycles on individual pages. 

Unfortunately, the memory management algorithms both in Samsung’s memory controller and in Apple’s iOS4 are proprietary.  Not only are the specifications of the individual subsystems unknown, but the interactions between the two are cause for concern. 

NAND Flash memory suffers from a mode in which repeated reads can indirectly cause adjacent memory cells to change state.  These changed cells will trigger the error detection and correction mechanism and be generally harmless.  It is unknown whether there is a threshold where a large number of bit errors in a page will cause that page to be moved or rewritten, and possibly even marked as bad.  The possibility exists, therefore, that simply reading flash in a pathological manner may result in additional hidden erase/write cycles, or possible additions to the bad block table.

It is also unknown how bad blocks are reported from the hardware to the operating system, and it is unclear how the file system will respond as the available known-good storage shrinks.  Meaningful studies or empirical results are difficult to achieve because of the statistical nature of the underlying failure mode, the number of levels of protection, and the differing implementations of different manufacturers and products. 

All systems should collect and make available absolute, quantitative statistics on the performance of these error detection and correction methods.  We can have no real confidence in a system if we do not know how close we are to the limits of its capabilities.  One thing is certain: “It seems to be working” is a recipe for disaster.

It is not beyond the realm of possibility that suitably malicious software could clandestinely bring virtually every page of the system’s flash memory to the brink of ECC failure and then wait for a trigger to push the system over the edge.

This would be an example of software that can physically damage modern hardware, and leave the user with no recourse but to replace the entire device.




Analysis

It would be preferable for the designers of development tools to strive toward the smallest possible set of features for the use of programmers.  By concentrating on the most frequently needed operations and making them clear and predictable the review process will be simplified.  Obscure or infrequently-used features should be only invoked with great fanfare.  Long keywords or elaborate syntactic requirements will draw attention to the fact that this code is not “business as usual” and deserves careful scrutiny.

Vulnerabilities, Exploits and Triggers

Traditionally, malware such as trojans, worms and viruses have relied on some vulnerability in a computer system’s design, implementation or operation.  Logic errors, unchecked pointers and buffer overflows are examples of vulnerabilities.  In general this vulnerability is independent of the exploit, or actual malware, specifically written by an attacker.  Once introduced into a vulnerable system, the malware may require an additional trigger event to begin malicious execution.  This allows infection of multiple systems to proceed undetected until a particular date, or remote command, causes the nefarious code to spring forth.  The trigger will always appear in the form of data within the infected system.

In the present analysis, the distinction between the vulnerability and exploit may appear to be blurred.  A sufficiently knowledgeable adversary may subtly introduce the entire body of malicious code into a large number of different application programs by patiently corrupting core technologies.  Using the definitions above, the actual vulnerability is the software design methodology itself, and the exploit could be virtually any piece of commonly used core software.

My primary thesis involves the social engineering that could be used to corrupt otherwise benign and robust software systems.  A secondary topic involves the acquired vulnerabilities that have evolved in software development “best practices”.  This involves using hardware and software features because “that’s the way you do it”, without any critical reexamination of whether those features actually make any sense in the year 2010, or in the application being developed. 

Several of these “evolutionary vulnerabilities” are readily apparent.

1.    The use of core open source frameworks by many completely unrelated applications.
2.    The programming style that allows and encourages interleaving of distinct objectives within “tight”, “efficient” or “multi-purpose” functions.
3.    The use of needlessly compact source notation without redundancy or cross-checks.
4.    The practice of allowing access to every data structure that a function MIGHT need to use without explicitly stating that access to a PARTICULAR structure is desired.
5.    Allowing the use of unnecessarily similar variable and function names.
6.    Operator overloading.
7.    Implied namespaces and namespace obfuscation.
8.    Conditional compilation mechanisms
9.    The inherent untestability of supporting multiple platforms.
10.    Unchecked and unconstrained Pointers.
11.    The Stack.
12.    Loops that do not look like loops - callbacks and exceptions.
13.    Dynamic code creation and execution - interpreters
14.    Portable devices that may operate unmonitored for extended intervals.
15.    Assuming that individual developers are experts in multiple programming languages.
16.    The vulnerability of different programming languages to naive mistakes.
17.    The lack of common version control systems among developers.
18.    The lack of a global cross-reference checking facility.
19.    The lack of inherent range and bounds checking at runtime.
20.    The lack of a central revocation authority.
21.    Automatic update systems themselves.
22.    The lack of a common threat analysis and notification system.
23.    The lack of a mechanism to track the installation of application programs in consumer devices.
24.    The lack of a mechanism to notify consumers of potential threats.
25.    The vulnerability of critical infrastructure to denial-of-service attacks.
26.    Trusted Software Developer Certificates that may be easily be circumvented by simply supplying that Trusted developer with malicious tools.


The Stack As An Unnecessary Vulnerability

Since the 1960's the use of a stack-based architecture has been considered a requirement for computer systems.  The stack provides a convenient storage area for function parameters, return addresses and local variables.  It inherently allows for recursion.  It makes exceptions and hardware interrupts easy to implement.  It minimizes memory use by sharing a single, dynamic area.

In the world of formal logic, recursion often represents an elegant and compact technique of explaining a complex operation.  In the world of computer software it is almost always a serious mistake.  There are a few cases in which recursion provides an elegant solution to a problem, but I contend that the risks of allowing universal recursive operations far outweigh the few instances in which any real benefit is derived. Anything that can be done by recursion can be done by iteration, and usually in a much safer and more controlled fashion. 

In the absence of recursion, the maximum calling depth can always be computed prior to execution of any given function.  In the best case, this could be done with a static calling-tree analysis by the compiler or linker.  In the worst case, the program loader must handle calls through dynamic linkages, and the loader must perform the analysis.  Knowing the possible calling tree implies that the actual maximum possible memory requirement can also be derived.  It thus becomes unnecessary to specify arbitrary stack space allocations.  Programs can be treated in a much more deterministic manner.

The fallacy of Mixing Data and Code Addresses - Modern hardware implements a single stack for each executable unit.  Programs use machine instructions to load function parameters and local variables into memory in the allocated stack area.  Call and Return operations use a program address placed in the same stack area.  This shared allocation is the vulnerability used by most “Arbitrary Code Execution” exploits.  It is completely unnecessary for the return address list to share a memory segment with function parameters and local data.  If this “conventional wisdom” were to be thoroughly reexamined, virtually all buffer-overrun exploits would be eliminated at the hardware level.  Data could still be wildly corrupted, but the flow of program execution would not be accessible to an attacker.

The fallacy of Necessary Recursion - The vast majority of functions in a modern application have clearly defined, static calling trees.  These functions have no need for any recursive features, and any recursion indicates a flaw.  The fact that modern languages automatically allow and encourage recursion means that recursion is an Error-Not-Caught in almost all cases.  It does not seem unreasonable to require that recursion (both direct and indirect) be indicated by some affirmative notation by the programmer. 

The fallacy of Saving Memory - The lack of static calling-tree analysis and the assumption of recursion means that arbitrary-sized segments are allocated to the stack.  Arbitrary allocations are always erroneous and lead to the mistaken impression that the software is reliable.  No one actually knows how close a system is to a stack overflow situation.  The presence of unnecessary memory allocation is a waste of resources and leaves a memory area where undetected malware can reside.

The contention that the stack architecture saves memory is one of the elementary explanations of the appeal of the stack.  This might be true if the alternative is a naive implementation in which all function parameters and locals were concurrently allocated from global memory.  Calling-tree analysis can be used to allocate parameter frames statically, and yet use only an amount of memory identical to the worst case of the actual calling pattern.

The fallacy of Hardware Interrupts - In order to achieve any degree of security, modern systems always switch stacks when a hardware interrupt is encountered.  Thus, it is not necessary that more than a rudimentary allocation be made in the application memory space.

The fallacy of Dynamic Stack Frames - Virtually all modern code computes parameters and pushes them onto the stack prior to a function call.  The functions allocate space for local variables by further adjustments to the stack pointer.  These dynamically-allocated stack frames are a source of needless, repetitive code that could be eliminated in many cases by static frame allocation and intelligent code optimization.  Again, static calling-tree analysis is used to determine the required allocation of these frame areas.

The fallacy of the Memory Dump - It is assumed that memory dumps can be a useful tool to allow crash analysis and code verification.  In reality, the use of the stack architecture and its immediate reuse of memory areas for consecutive function calls means that the internal state of any function is destroyed shortly after that function exits.  If the stack frames were statically allocated the system would tend to preserve parameters and local variables after the completion of any particular function.  The implementations of exception-handling functions (or the dump facility itself) could easily be marked to use frames outside the normal (overlapping) frame area.

The open source development community is an ideal place to implement advanced compiler / linker / loader technology that revises the calling conventions used by modern software.  Every application that operates unexpectedly when the calling conventions are changed is an application that was most likely harboring design fallacies that had been unrecognized.  Consider this an opportunity to radically improve all open source software with a single paradigm shift. 

Hardware and software systems have grown mostly by accretion over the years.  The goal has almost universally been expediency: make it run fast and get it done now!  Little thought has been given to mitigating common sources of error, except in academic circles. 

Much effort goes into testing, primarily to validate the interoperability of various software modules or systems.  In general the goal is to ensure that changes made to a new version do not break features of a previously certified application.

In the biological world, organisms develop resistance to antibiotics through exposure.  Malware - whether accidental or intentional - will grow and thrive at the boundaries of the test cases.  Such malware may spread in a benign form for long periods, only to be triggered into an active form by a possibly innocuous event.


Recommendations

It has been demonstrated that it will be essentially impossible to exclude the accidental or deliberate introduction of malicious behavior into software during its development and maintenance. 

Therefore, instead of trying to control humans and their behavior, it would seem reasonable to treat the software itself as the adversary.  If every line of code, piece of data and linked module was considered a threat it might be possible to develop high quality threat abatement tools that would have a better chance of success than other approaches. 

The open source community is the perfect place to develop such mitigation strategies.  Proprietary software development efforts lack the resources, and tend to hide, deny and fail to document vulnerabilities.  Open source developers have the opportunity to take both white hat and black hat roles.  Adding test cases that succeed or fail in different implementations is a valuable contribution to the robustness of any software.  Such continuing development of both code and validation cases should be the norm.  Improvement should be continuous and incremental, without the need for monthly “Critical Updates” or other disruptive strategies that are unevenly applied and of questionable effectiveness.

1.    Software development methodology
    a.    Require the Designer to provide complete natural-language functional specification document for all software systems, modules and functions, as well as example test cases.
    b.    Require software to be written exactly to specification by at least two independent development groups, none of which were the Designer of the specification.  Preferably this will be accomplished in different programming languages.
    c.    Disallow direct communication between independent development groups.
    d.    Resolve ambiguities and conflicts between implementations by changes to the specification document, incorporated exclusively by the Designer.
    e.    Require each development group to provide test cases which are not shared with other development groups.
    f.    Provide each development group’s software to a Validation group which is not privy to the specifications.  The Validation group runs
        i.    Stress tests with all known test cases,
        ii.    Stress test with random inputs,
        iii.    Stress tests with random structures and data types. 
        iv.    Stress test with all supported operating environments.
        v.    Expect all results to be identical from each group.
            (1)    This implies detecting all changes to global memory and confirming that they are allowed and intended.
            (2)    include range and sanity checks for all returned values.
    g.    Validation group will record all resource utilization, including speed, memory usage, and external communication.
        i.    Resource utilization, including external memory and references must be identical.
        ii.    Every failed validation must be documented and traced to its origin.  The nature of the original error must be identified and shared.  Repeated problem areas should be studied and mitigation methods developed.
    h.    One implementation will be chosen for production use, perhaps based on speed, compactness or programming language.  The alternative implementations will be available for validation testing of higher-level modules. 
    i.    New features and future versions will start with changes to the specification by the Designer and will end with comparison of recorded resource utilizations.
        i.    Any changes in resource utilization from one version to the next, especially global references, must be properly confirmed.

2.    Stick with one set of development tools.  Do not change the core library that your developers use every time a new release comes out. Validation and version control are needlessly complicated if third-parties can randomly revise any pieces of your software.

3.    Use a version control system that captures every piece of software, tool, source file, header file, library, test file, etc. necessary to build and test each release candidate. 
    a.    Build the final release version on an independent system with a clean OS installation using only the files extracted from the version control system.
    b.    At the very least, when the inevitable disaster strikes it will be possible to identify the versions of your software that are affected.

4.    Develop a runtime linkage system capable of swapping out implementations of a particular function or module on the fly.
    a.    In the verification process, this would allow the verification system to generate random switches between implementations and ensure continued correct operation of the system.
    b.    In the operational case, normally only one implementation of each function would be distributed.  This mechanism would allow for the distribution of software updates into running systems without requiring a reboot in many cases.

“What I tell you three times is true.”
The Hunting of the Snark
- Lewis Carroll

These suggestions may seem onerous, especially to small developers.  This type of approach can easily be implemented using only four individuals: Designer, (2) Developers and a Validator.  These roles may be traded for each different module or feature of a project.  Far from increasing effort or time-to-market, it could be argued that the improved documentation, cross-training and more robust final product actually reduce overall development effort.  New employees can be of immediate use and can be rapidly integrated into the corporate or community structure by assuming any one of the roles without the need for a lengthy training period.

Converting software to another language or porting it to different hardware will be greatly simplified by the comprehensive documentation and test cases inherent in this method.  Identifying the ramifications of bugs (detected by whatever means) will be more comprehensive and rapid if the development tools allow easy generation of a list of all software and modules that use a given feature.