The following paper was written in October of 2010 and distributed in draft form to several infrastructure security organizations. Not a single one made any response or gave any indication that they are considering the problem. I now publish it openly for indexing by any search engines that happen along. The date is 21 July 2011. How long will it be before we hear about these types of attack on the evening news?
The examples are written in a form that many readers may enjoy. I hope that I have conveyed some sense of the ease with which even quality programmers may be duped by crafty opponents. Toward the end of the paper things get rather technical. Don't worry -- when you get to parts don't understand you have probably gotten all you need.
The examples are written in a form that many readers may enjoy. I hope that I have conveyed some sense of the ease with which even quality programmers may be duped by crafty opponents. Toward the end of the paper things get rather technical. Don't worry -- when you get to parts don't understand you have probably gotten all you need.
Brian McMillin
Abstract
A hypothetical attack on core open source software technologies is presented. Extreme danger lies in the fact that these potentially compromised core technologies may be incorporated into an almost unlimited number of different application programs, unknowingly created and marketed by unrelated organizations which may be completely unable to determine if they are distributing malware. Mitigation strategies are discussed, although none are anticipated to be effective.
Background
Open source software is an increasingly important method of developing modern applications and tools. In many cases the collaborative work of different authors provides for new features and qualified review that would be impractical for any corporate effort. The wide availability, ease of use, and inherent peer-review of open source packages makes them tremendously appealing to virtually all developers.
Unfortunately, it is this very collaborative nature and peer-review that opens the door for social manipulation and creating the illusion of quality and safety while masking malevolent software.
The term malware is usually used to mean intentionally malicious software designed to compromise a target system. From the user’s perspective there is little difference between a system that fails due to deliberate machinations and one which fails simply due to buggy software. Accidents and intentional attacks have the same effect. In this analysis I treat both cases equivalently.
Social engineering is a term associated in the public’s mind with spreading computer viruses via email. Disguising a threat with some desirable or benign coating (a picture of Martina Navratilova, or a valentine from a secret admirer) causes the user to circumvent the computer security system. A threat that causes a panic reaction can thwart common sense: “Your computer has a VIRUS! Click here to fix it.” or “Your account has been suspended due to suspicious activity. Click here to sign in and review your transactions.”
Social engineering in the open source software development community can take many forms. Popular tools or packages, a friendly author in a forum, beta testing opportunities, web-based code snippet libraries - all can be the source of code which may fail to receive the scrutiny that it should.
Modern software systems are far too complex for any individual or organization to adequately evaluate, monitor, test or verify. Malevolent code can be incredibly compact - sometimes requiring only a single character.
Examples
Let me begin by emphasizing that there have been no known intentional examples of the use of these attacks to date. The examples describe accidentally introduced bugs in actual software that could be devastating if carefully placed by a knowledgeable adversary.
Example: Single-Bit Date-Time Bug
The firmware for an access control system was found to contain a single-character typographic error which had the effect of rendering day-of-week calculations inaccurate. The error was discovered during testing, only when employees were unable to enter the building on the first Monday of April. Supervisor access was not restricted. Troubleshooting revealed that the system believed that the date was part of a weekend. The erroneous line of code is reproduced below.
DB 31,28,31.30,31,30,31,31,30,31,30,31 ; Month Lengths
This example is particularly noteworthy because the error (substitution of a period for a comma) is actually only a single-bit error in the ASCII encoding of over 300K bytes of source code. The keys on the keyboard are adjacent, and the difference in visual appearance of the characters is minimal (and was barely discernable on the displays being used).
The error was not discovered during initial development testing because it was dependent on future values of the real-time clock. Code review did not catch the error because reviewers focused on verifying the “important” things - in this case the sequence of numeric values, and took for granted the punctuation. The assembler software failed to indicate an error because an obscure syntactic convention allowed the overloaded period character to be interpreted as a logical AND operation between the two integer values.
The line between accident and malice can be quite fuzzy. This ambiguity can allow a knowledgeable adversary to obscure an attack by hiding it among hundreds of lines of well-written, clean code. Furthermore, since the malicious nature of any error can easily be explained as human error, the attacker remains free to try again if discovered. The peer-review process may even be commended for finding and correcting the error, while giving the adversary additional information to improve the next attack.
Example: Intel FDIV Bug
The Intel FDIV bug was an error in the floating point division algorithm in certain versions of the Pentium processor. Apparently, the actual underlying error was confined to five cells in a lookup table that were unintentionally left blank.
The effect was that software running on these processors would occasionally receive computational results which were in error after the fifth decimal digit. Subtle errors such as this are extremely difficult to detect - in fact it took a skilled number theorist with great tenacity several months to isolate and demonstrate the problem.
In any case, it would have been far easier to have certified the processor correctness at the design stage by ensuring that the lookup tables were computed and verified by multiple independent sources prior to production. In fact, by the time the bug was publicized, Intel had already produced processors using the same algorithm which were free of the bug.
Thomas R. Nicely discovered and publicized this bug during 1994, and in December of that year Intel recalled and replaced all affected processors. In his analysis of this situation, he concludes:
Computations which are mission critical, which might affect someone's life or well being, should be carried out in two entirely different ways, with the results checked against each other. While this still will not guarantee absolute reliability, it would represent a major advance. If two totally different platforms are not available, then as much as possible of the calculations should be done in two or more independent ways. Do not assume that a single computational run of anything is going to give correct results---check your work!
The legacy of this bug is that, fifteen years later, many development tools for Intel-based software still include conditional code relating to the recommended work-around for this hardware error. The mentality that justifies this as “Oh, that’s always been there”, or “It doesn’t do anything, but it can’t hurt” is a symptom of a larger problem. Unnecessary code is always harmful. At the very least it allows extra opportunities for undetected corruption to occur. It fosters an un-critical mind set in the reviewer.
Code paths that are never executed can never be tested. Their presence in modern production code should be considered suspicious. Hiding malware in such untested but ubiquitous code potentially allows for its wide distribution. Dormant code such as this needs only a suitably crafted trigger to affect all compromised systems.
This five-value error caused an economic impact of $500 million to Intel in 1995, and is still being felt in unquantifiable ways today.
As the Iranian nuclear program found out, to its own detriment, it is never a good idea to run unverified industrial control software on your black-market enrichment centrifuges.
The precision bearings tend to eat themselves for lunch.
What if the table errors in the floating-point algorithm were not as blatant as being left zeroed out? What if the tables contained a carefully selected number of random errors? And were embedded in the floating-point unit of a counterfeit GPS chip? And that chip happened to find itself in the terminal guidance system of an opponent's missile? And that the only effect was to change the rate of successful position updates from 100 per second to one per second? And the CEP (circular error probable) for the missile went from 3 feet to 3000 feet?
This kind of thing could win or lose a war.
And how could one ever expect to detect such a deeply-embedded, subtle attack?
How much would such an attack cost?
Would it be worth it for an adversary to try?
Example: NASA End-Of-Year Protocol
NASA space shuttles use a voting system of three active and two standby computers for flight control operations. Each of these systems is intended to be essentially identical, and each runs identical software. The intention is to detect and mitigate hardware failures, as these are deemed to be the most likely source of problems during a mission lasting two weeks or so.
Even so, flight rules prevent any shuttle from flying on New Year’s Day, since it is well recognized that the operating software cannot be positively certified to operate correctly when the year changes. This is especially true when the shuttle orbiter is viewed as a small part of the much larger system involving communications, tracking, navigation, and planning systems which are geographically distributed throughout the world. Ensuring that every component of this worldwide network will be free of anomalies when the year changes is viewed as an insurmountable problem and an unnecessary risk.
Example: McAfee Automatic Update
On April 21, 2010 McAfee Software released an update to its anti-virus software which incorrectly identified legitimate SVCHOST.EXE operating system files on Microsoft Windows XP systems as the W32/Wecorl.a virus. Affected systems were locked in an endless reboot sequence and required manual intervention in the form of a local data load by a knowledgeable person to recover.
At least one police department instructed its officers to turn off their patrol car computers to protect them from the McAfee update. It is unclear why every patrol car should have been running anti-virus software in the first place. Much greater security and performance could be gained by closing the department’s network and installing proper protection at the gateways.
Anti-virus software is by its very nature a social engineering phenomenon. The threat of malware and the lack of confidence in our legitimate operating systems and software has led us to the perception that we must install software which slows performance and causes unpredictable and non-deterministic behavior under normal circumstances. The fact that perfectly good, working systems can have their behavior altered by anti-virus updates on a daily (or perhaps hourly) basis is, in itself, a source of great concern.
The fact that updates are allowed to proceed in an automated mode may be acceptable or even desirable for consumer products. For dedicated applications or mission-critical systems there is little justification for automatic updates.
Example: Adobe Flash Player
Much controversy attends the question of Adobe Flash player and HTML 5 features on mobile devices such as iPhone. It has been claimed that the Flash player is buggy, a resource hog, and responsible for many system crashes. The Flash player is a proprietary piece of software implementing a proprietary standard. It is difficult to understand why the open source community, principally revolving around the Android operating system, seems to be more vocal in their support of Flash than Apple, who champions the open source HTML 5 standards.
In reality, the controversy appears to be an example of social engineering, designed to allow a proprietary standard to maintain dominance in an evolving marketplace.
It is true that MacroMedia (now Adobe) filled a real and important need by developing Flash in an era when no standard mechanism for animation or user interaction with computers existed. The time has come for such ad-hoc early forays into user interfaces to yield to more mature, carefully designed systems that incorporate the best features discovered so far and meet the requirements of modern systems.
Proprietary systems will always be more vulnerable than open systems due to the limited resources and unknown business priorities of the controlling company.
Example: Zune 30GB Music Player Leap Year Bug
On December 31, 2008, all Microsoft Zune 30GB Music Players failed during the boot sequence. The software that failed was the Real-Time Clock driver firmware for the Freescale Semiconductor MC13783 Power Management and Audio chip. Near the end of the boot process, the driver was called to convert the internal Days and Seconds representation of the current time into Year, Month and Day. On the 366th day of the year, the year-conversion loop would fail to exit, thus causing the device to hang permanently at that point. The work-around was to allow the batteries to run completely down and to wait until the next day to restart the device.
The problematic driver software was contained in the rtc.c source file provided by Freescale Semiconductor to customers of its products. The ConvertDays function was missing an else break; statement which would have correctly terminated the loop. Using the normal formatting conventions adopted by Freescale, this would probably have added two lines to the 767 lines in this file.
A second function in this same file, called MX31GetRealTime, uses exactly the same loop structure for year conversion and includes diagnostic message outputs, apparently intended for verifying the calculations. In the day 366 case, this code would output the (incorrect) message “ERROR calculate day”, and then break the loop. In other words, if Freescale’s own diagnostics had been used to test the code there would have been a single suspicious message among a flurry of output, but the diagnostic code would not have hung. If the real code had been tested or simulated on the correct date, the hang would have been discovered.
Note that the chip in question is called a “Power Management and Audio” chip. Page 2 of Freescale’s Data Sheet lists 17 features for this chip, including battery chargers, regulators, audio amplifiers, CODECs, multiple audio busses, backlight drivers, USB interface and touchscreen interface. The Real-Time Clock is item 13 of 17 on this list.
It is clear that this is an example of a catastrophic bug in a “trivial” function, buried deep within mountains of code implementing “important” features. This code was provided by a trusted supplier. The features of the chip are so complex (and proprietary) that users (in this case, Microsoft) have little alternative but to accept the supplied code without exhaustive or critical examination.
Example: Sony Root Kit
In 2005, Sony BMG Music released over 100 titles of music CDs that surreptitiously installed rootkit software on user’s computers running Microsoft Windows. The alleged purpose of this rootkit was to provide copy protection for the music, but in actuality provided cloaking technology and a back door for malware. Prior to legal action and the eventual recall of all Sony CDs with the XCP technology, over 500,000 computers were compromised.
The corporate mindset at Sony that viewed their own consumers as an enemy, stark terror in the face of declining sales, and a total naivety concerning computer technology left them vulnerable to manipulation by groups selling Digital Rights Management software.
In the case of XCP, it also demonstrated that anti-virus services can be manipulated simply by the choice of names used by the malware. Because it was being distributed by a giant corporation and was covered by the aura of anti-piracy claims, the anti-virus services spent more than one year allowing the infestation to grow. This despite the fact that, in all respects, the software behaved maliciously by (1) being loaded from a music CD, (2) replacing system files, (3) cloaking registry entries and (4) conducting clandestine communications with a BMG host computer.
Sidebar: A Tirade Against Digital Rights Management Software Digital Rights Management software may be viewed as malware, in that its purpose is to selectively block access to certain data or programs using arbitrary and unexpected rules. Any software that behaves differently on one machine than another, or that works one day and not the next, should be viewed with great suspicion. DRM software is operationally indistinguishable from malware. Test and verification of DRM software is, by its very nature, difficult for its own developers. In addition, the presence of DRM features on a particular system makes the performance of that system essentially impossible to certify. Any software that cannot be backed up, restored, and made fully operational at an arbitrary point in the future should not be allowed in a professional development environment. Software that includes timeouts, or that requires contact with a validation server is not reliable. Any software whose continued operation is subject to the corporate whims of third parties is fundamentally unsafe. Programs that include behaviors that are dependent on hardware identity (station names, MAC addresses or IP addresses), date - time values, random or pseudo-random numbers, and cryptographic codes are inherently difficult to verify. If at all possible, these features, where required, should be carefully isolated from as much of the production code as possible. Since there can be no universal guarantee of network connectivity or the continued operation of a central server (such as a licensing server), I would argue that any software that implements “time bomb” behavior or otherwise deliberately ceases to function if it does not receive periodic updates should be banned. Experience has shown that DRM software is generally ineffective in achieving its stated goal, and causes undue hardship to legitimate users of the product. Development efforts would be much more productive if they were directed toward improving the experience of all users, instead of trying to restrict some users. |
Example: Physical Damage to Memory
In the late 1960's the DECsystem 10 used core memory for its primary storage. There existed a memory diagnostic program designed to find errors in this core memory array. The diagnostic proceeded to repeatedly read and write sequential locations. It was found that this diagnostic would almost always find bad locations - even in known good arrays - and that entire rows would be genuinely bad after the diagnostic ran. Investigation proved that the continuous cycling of the three Ampere (!) select current pulses were physically burning out the hair-thin select lines in the array.
The memory design engineers had known of this possibility, but discounted it as a failure mode because the system was equipped with a semiconductor memory cache that would prevent repeated operations on the same address. Naturally, the designer of the memory diagnostic included instructions that explicitly disabled the cache.
Forty years later, our most modern portable devices use high density NAND flash memory as their storage mechanism of choice. Flash memory relies on the storage of small quantities of electric charge in tiny cells, and the ability to accurately measure that charge. In order to store new values in this type of memory, entire pages must be erased and then sequentially written. The 16GB flash memory used in the iPhone 4 (for example) stores multiple bits in each memory cell using different voltage levels to distinguish values. The ability of these cells to reliably store and distinguish bits begins to degrade after only 3000 page erase cycles. Elaborate hardware and software mechanisms exist to detect and correct errors, and to provide alternate memory pages to replace failed areas. In order to achieve acceptable production and operational yields and longevity, modern error correcting systems are typically capable of correcting 12 or more bit errors in a single block. Furthermore, wear-leveling algorithms attempt to prevent excessive erase/write cycles on individual pages.
Unfortunately, the memory management algorithms both in Samsung’s memory controller and in Apple’s iOS4 are proprietary. Not only are the specifications of the individual subsystems unknown, but the interactions between the two are cause for concern.
NAND Flash memory suffers from a mode in which repeated reads can indirectly cause adjacent memory cells to change state. These changed cells will trigger the error detection and correction mechanism and be generally harmless. It is unknown whether there is a threshold where a large number of bit errors in a page will cause that page to be moved or rewritten, and possibly even marked as bad. The possibility exists, therefore, that simply reading flash in a pathological manner may result in additional hidden erase/write cycles, or possible additions to the bad block table.
It is also unknown how bad blocks are reported from the hardware to the operating system, and it is unclear how the file system will respond as the available known-good storage shrinks. Meaningful studies or empirical results are difficult to achieve because of the statistical nature of the underlying failure mode, the number of levels of protection, and the differing implementations of different manufacturers and products.
All systems should collect and make available absolute, quantitative statistics on the performance of these error detection and correction methods. We can have no real confidence in a system if we do not know how close we are to the limits of its capabilities. One thing is certain: “It seems to be working” is a recipe for disaster.
It is not beyond the realm of possibility that suitably malicious software could clandestinely bring virtually every page of the system’s flash memory to the brink of ECC failure and then wait for a trigger to push the system over the edge.
This would be an example of software that can physically damage modern hardware, and leave the user with no recourse but to replace the entire device.
Analysis
It would be preferable for the designers of development tools to strive toward the smallest possible set of features for the use of programmers. By concentrating on the most frequently needed operations and making them clear and predictable the review process will be simplified. Obscure or infrequently-used features should be only invoked with great fanfare. Long keywords or elaborate syntactic requirements will draw attention to the fact that this code is not “business as usual” and deserves careful scrutiny.
Vulnerabilities, Exploits and Triggers
Traditionally, malware such as trojans, worms and viruses have relied on some vulnerability in a computer system’s design, implementation or operation. Logic errors, unchecked pointers and buffer overflows are examples of vulnerabilities. In general this vulnerability is independent of the exploit, or actual malware, specifically written by an attacker. Once introduced into a vulnerable system, the malware may require an additional trigger event to begin malicious execution. This allows infection of multiple systems to proceed undetected until a particular date, or remote command, causes the nefarious code to spring forth. The trigger will always appear in the form of data within the infected system.
In the present analysis, the distinction between the vulnerability and exploit may appear to be blurred. A sufficiently knowledgeable adversary may subtly introduce the entire body of malicious code into a large number of different application programs by patiently corrupting core technologies. Using the definitions above, the actual vulnerability is the software design methodology itself, and the exploit could be virtually any piece of commonly used core software.
My primary thesis involves the social engineering that could be used to corrupt otherwise benign and robust software systems. A secondary topic involves the acquired vulnerabilities that have evolved in software development “best practices”. This involves using hardware and software features because “that’s the way you do it”, without any critical reexamination of whether those features actually make any sense in the year 2010, or in the application being developed.
Several of these “evolutionary vulnerabilities” are readily apparent.
1. The use of core open source frameworks by many completely unrelated applications.
2. The programming style that allows and encourages interleaving of distinct objectives within “tight”, “efficient” or “multi-purpose” functions.
3. The use of needlessly compact source notation without redundancy or cross-checks.
4. The practice of allowing access to every data structure that a function MIGHT need to use without explicitly stating that access to a PARTICULAR structure is desired.
5. Allowing the use of unnecessarily similar variable and function names.
6. Operator overloading.
7. Implied namespaces and namespace obfuscation.
8. Conditional compilation mechanisms
9. The inherent untestability of supporting multiple platforms.
10. Unchecked and unconstrained Pointers.
11. The Stack.
12. Loops that do not look like loops - callbacks and exceptions.
13. Dynamic code creation and execution - interpreters
14. Portable devices that may operate unmonitored for extended intervals.
15. Assuming that individual developers are experts in multiple programming languages.
16. The vulnerability of different programming languages to naive mistakes.
17. The lack of common version control systems among developers.
18. The lack of a global cross-reference checking facility.
19. The lack of inherent range and bounds checking at runtime.
20. The lack of a central revocation authority.
21. Automatic update systems themselves.
22. The lack of a common threat analysis and notification system.
23. The lack of a mechanism to track the installation of application programs in consumer devices.
24. The lack of a mechanism to notify consumers of potential threats.
25. The vulnerability of critical infrastructure to denial-of-service attacks.
26. Trusted Software Developer Certificates that may be easily be circumvented by simply supplying that Trusted developer with malicious tools.
The Stack As An Unnecessary Vulnerability
Since the 1960's the use of a stack-based architecture has been considered a requirement for computer systems. The stack provides a convenient storage area for function parameters, return addresses and local variables. It inherently allows for recursion. It makes exceptions and hardware interrupts easy to implement. It minimizes memory use by sharing a single, dynamic area.
In the world of formal logic, recursion often represents an elegant and compact technique of explaining a complex operation. In the world of computer software it is almost always a serious mistake. There are a few cases in which recursion provides an elegant solution to a problem, but I contend that the risks of allowing universal recursive operations far outweigh the few instances in which any real benefit is derived. Anything that can be done by recursion can be done by iteration, and usually in a much safer and more controlled fashion.
In the absence of recursion, the maximum calling depth can always be computed prior to execution of any given function. In the best case, this could be done with a static calling-tree analysis by the compiler or linker. In the worst case, the program loader must handle calls through dynamic linkages, and the loader must perform the analysis. Knowing the possible calling tree implies that the actual maximum possible memory requirement can also be derived. It thus becomes unnecessary to specify arbitrary stack space allocations. Programs can be treated in a much more deterministic manner.
The fallacy of Mixing Data and Code Addresses - Modern hardware implements a single stack for each executable unit. Programs use machine instructions to load function parameters and local variables into memory in the allocated stack area. Call and Return operations use a program address placed in the same stack area. This shared allocation is the vulnerability used by most “Arbitrary Code Execution” exploits. It is completely unnecessary for the return address list to share a memory segment with function parameters and local data. If this “conventional wisdom” were to be thoroughly reexamined, virtually all buffer-overrun exploits would be eliminated at the hardware level. Data could still be wildly corrupted, but the flow of program execution would not be accessible to an attacker.
The fallacy of Necessary Recursion - The vast majority of functions in a modern application have clearly defined, static calling trees. These functions have no need for any recursive features, and any recursion indicates a flaw. The fact that modern languages automatically allow and encourage recursion means that recursion is an Error-Not-Caught in almost all cases. It does not seem unreasonable to require that recursion (both direct and indirect) be indicated by some affirmative notation by the programmer.
The fallacy of Saving Memory - The lack of static calling-tree analysis and the assumption of recursion means that arbitrary-sized segments are allocated to the stack. Arbitrary allocations are always erroneous and lead to the mistaken impression that the software is reliable. No one actually knows how close a system is to a stack overflow situation. The presence of unnecessary memory allocation is a waste of resources and leaves a memory area where undetected malware can reside.
The contention that the stack architecture saves memory is one of the elementary explanations of the appeal of the stack. This might be true if the alternative is a naive implementation in which all function parameters and locals were concurrently allocated from global memory. Calling-tree analysis can be used to allocate parameter frames statically, and yet use only an amount of memory identical to the worst case of the actual calling pattern.
The fallacy of Hardware Interrupts - In order to achieve any degree of security, modern systems always switch stacks when a hardware interrupt is encountered. Thus, it is not necessary that more than a rudimentary allocation be made in the application memory space.
The fallacy of Dynamic Stack Frames - Virtually all modern code computes parameters and pushes them onto the stack prior to a function call. The functions allocate space for local variables by further adjustments to the stack pointer. These dynamically-allocated stack frames are a source of needless, repetitive code that could be eliminated in many cases by static frame allocation and intelligent code optimization. Again, static calling-tree analysis is used to determine the required allocation of these frame areas.
The fallacy of the Memory Dump - It is assumed that memory dumps can be a useful tool to allow crash analysis and code verification. In reality, the use of the stack architecture and its immediate reuse of memory areas for consecutive function calls means that the internal state of any function is destroyed shortly after that function exits. If the stack frames were statically allocated the system would tend to preserve parameters and local variables after the completion of any particular function. The implementations of exception-handling functions (or the dump facility itself) could easily be marked to use frames outside the normal (overlapping) frame area.
The open source development community is an ideal place to implement advanced compiler / linker / loader technology that revises the calling conventions used by modern software. Every application that operates unexpectedly when the calling conventions are changed is an application that was most likely harboring design fallacies that had been unrecognized. Consider this an opportunity to radically improve all open source software with a single paradigm shift.
Hardware and software systems have grown mostly by accretion over the years. The goal has almost universally been expediency: make it run fast and get it done now! Little thought has been given to mitigating common sources of error, except in academic circles.
Much effort goes into testing, primarily to validate the interoperability of various software modules or systems. In general the goal is to ensure that changes made to a new version do not break features of a previously certified application.
In the biological world, organisms develop resistance to antibiotics through exposure. Malware - whether accidental or intentional - will grow and thrive at the boundaries of the test cases. Such malware may spread in a benign form for long periods, only to be triggered into an active form by a possibly innocuous event.
Recommendations
It has been demonstrated that it will be essentially impossible to exclude the accidental or deliberate introduction of malicious behavior into software during its development and maintenance.
Therefore, instead of trying to control humans and their behavior, it would seem reasonable to treat the software itself as the adversary. If every line of code, piece of data and linked module was considered a threat it might be possible to develop high quality threat abatement tools that would have a better chance of success than other approaches.
The open source community is the perfect place to develop such mitigation strategies. Proprietary software development efforts lack the resources, and tend to hide, deny and fail to document vulnerabilities. Open source developers have the opportunity to take both white hat and black hat roles. Adding test cases that succeed or fail in different implementations is a valuable contribution to the robustness of any software. Such continuing development of both code and validation cases should be the norm. Improvement should be continuous and incremental, without the need for monthly “Critical Updates” or other disruptive strategies that are unevenly applied and of questionable effectiveness.
1. Software development methodology
a. Require the Designer to provide complete natural-language functional specification document for all software systems, modules and functions, as well as example test cases.
b. Require software to be written exactly to specification by at least two independent development groups, none of which were the Designer of the specification. Preferably this will be accomplished in different programming languages.
c. Disallow direct communication between independent development groups.
d. Resolve ambiguities and conflicts between implementations by changes to the specification document, incorporated exclusively by the Designer.
e. Require each development group to provide test cases which are not shared with other development groups.
f. Provide each development group’s software to a Validation group which is not privy to the specifications. The Validation group runs
i. Stress tests with all known test cases,
ii. Stress test with random inputs,
iii. Stress tests with random structures and data types.
iv. Stress test with all supported operating environments.
v. Expect all results to be identical from each group.
(1) This implies detecting all changes to global memory and confirming that they are allowed and intended.
(2) include range and sanity checks for all returned values.
g. Validation group will record all resource utilization, including speed, memory usage, and external communication.
i. Resource utilization, including external memory and references must be identical.
ii. Every failed validation must be documented and traced to its origin. The nature of the original error must be identified and shared. Repeated problem areas should be studied and mitigation methods developed.
h. One implementation will be chosen for production use, perhaps based on speed, compactness or programming language. The alternative implementations will be available for validation testing of higher-level modules.
i. New features and future versions will start with changes to the specification by the Designer and will end with comparison of recorded resource utilizations.
i. Any changes in resource utilization from one version to the next, especially global references, must be properly confirmed.
2. Stick with one set of development tools. Do not change the core library that your developers use every time a new release comes out. Validation and version control are needlessly complicated if third-parties can randomly revise any pieces of your software.
3. Use a version control system that captures every piece of software, tool, source file, header file, library, test file, etc. necessary to build and test each release candidate.
a. Build the final release version on an independent system with a clean OS installation using only the files extracted from the version control system.
b. At the very least, when the inevitable disaster strikes it will be possible to identify the versions of your software that are affected.
4. Develop a runtime linkage system capable of swapping out implementations of a particular function or module on the fly.
a. In the verification process, this would allow the verification system to generate random switches between implementations and ensure continued correct operation of the system.
b. In the operational case, normally only one implementation of each function would be distributed. This mechanism would allow for the distribution of software updates into running systems without requiring a reboot in many cases.
“What I tell you three times is true.”
The Hunting of the Snark
- Lewis Carroll
These suggestions may seem onerous, especially to small developers. This type of approach can easily be implemented using only four individuals: Designer, (2) Developers and a Validator. These roles may be traded for each different module or feature of a project. Far from increasing effort or time-to-market, it could be argued that the improved documentation, cross-training and more robust final product actually reduce overall development effort. New employees can be of immediate use and can be rapidly integrated into the corporate or community structure by assuming any one of the roles without the need for a lengthy training period.
Converting software to another language or porting it to different hardware will be greatly simplified by the comprehensive documentation and test cases inherent in this method. Identifying the ramifications of bugs (detected by whatever means) will be more comprehensive and rapid if the development tools allow easy generation of a list of all software and modules that use a given feature.