Monday, September 12, 2011

What is a Source File?

Abstract
Source Files are ubiquitous in the world of software development. Little thought is given to their core technology concept, which is now more than fifty years old. The transition from a desktop-PC to tablet-based computing environment represents a sweeping paradigm shift for software developers. Rethinking the true requirements of software from the Source level onward will allow more modern tools to enter the development arena. This essay presents an analysis of the situation from a historical perspective and proposes new methodologies for use in the future.

Introduction
The concept of the Source File is so fundamental and so ingrained in the lives of software developers that there is seldom any thought given to its true function.

Historically, the program source was simply the human-readable form of input to a language compiler, assemble or translator. The most important aspect of Source Code was that it be able to be fed to the language processor without generating errors. This meant that virtually everything about the process of program creation was geared toward simplifying the task for the compiler. Combined with the early data entry methods - Hollerith cards , teletype machines and paper tape - the idea of a program source became established.

And now we find ourselves, fifty or more years later, expecting program source to look like it could be printed on a teletype machine. Eighty columns (or so). Fixed pitch fonts. Nothing but ASCII characters.
The scope of programming requirements has changed radically from the early days of computing.

Programming languages and methodologies have evolved in many directions - some useful, others not so much. Our largest projects still suffer from the "simplify it for the compiler" mindset. A prime example is the ever-present header files used by most modern programming tools. These redundant and hard-to-maintain files are used to provide descriptions of the required linkages in modular programming. Newer programming environments attempt to deal with the header/linkage problem with added features or conventions that allow the tools to handle much of this drudgery. Handling the header problem is only one of many steps that need to be taken to convert from a "what is easiest for the compiler" to a "what is easiest and least error-prone for the programmer" mentality.

What Are Source Files?
A Source File is a collection of text (variously known a syntactic symbols or tokens) that are either:
  • Human language, as in comments 
  • Directives, usually for the top-level language processor that uses the particular Source
  • Programming Language, the statements or syntactic constructs that we think of as the program itself  
  • Literals, as in data which will be manipulated by the program but which is encoded in some fashion and stored within the Source.
Note that String Literals almost always contain some form of data that could be described as Source Code. For example, the classic "Hello, World!" program used in beginning programming classes is a program which manipulates data in another language - in this case, English. Most string manipulation in modern programs exists solely for building elements of another programming language, which will be fed to another language processor.

In addition, many Source Files today contain sections written in many different programming languages.  For example, an HTML file might include CSS, JavaScript, HTML and English.  Modern text editors attempt (with varying degrees of success) to help sort all this out for the developer.  Highlighting, auto-completion and various warnings are used to help prevent the (sometimes spectacular) errors that result from feeding one language into the processor that is expecting another.  
I will make a brief sidebar to rail against the very poor implementation of what is referred to as "internationalization".  If a modern application needs to be relevant to a world-wide audience it is expected that the developers isolate all locale-specific aspects into separate "resource" files, to use localizable operating system calls that can never be evaluated on the developer's system and use character sets and string manipulation tools to support character sets display, and keyboard modes that also cannot be tested on the developer's equipment.

Internationalization should be inherent in the program development process - not scabbed-on to a final product.  International test cases should always be visible to the developer and the operation and aesthetics of the product should be visible in a simple and robust manner at all times. To achieve this, the adaptive, multi-lingual keyboards on tablet computers, as well as the world-wide collaboration techniques discussed here become critically important.

Consider the notational nightmare that results from trying to pass a literal CSS property name to a JavaScript function that is being invoked from an HTML event handler.  The nesting of escaped quotation marks converts a (relatively) simple concept into something that requires deep understanding by the programmer.  And is thus virtually impossible for independent developers to modify or verify.

Unfortunately, one only gets help from the editor in the static situation such as the hand-written web page.  If the program needs to dynamically generate code (such as SQL queries) it becomes much more difficult to test, debug and verify.  There are virtually no tools that address this dynamic code creation problem, although it is one of the most common tasks performed today.  Many ad-hoc build-a-string techniques are used, but never with any consistency or robust certification.

How Are Source Files Used?

Source files seem to be used in four distinct but interrelated ways.
  • Design
In the Design phase the source is used as part of the collaboration and documentation for the project.  When properly done, this provides a good part of the framework that multiple developers use to ensure accurate understanding of the requirements and to convey details of the implementation as it progresses.
  • Edit
The Edit phase is where most human interaction with the source occurs.  This is where much current effort in man-machine interaction is focused.  Making better editors with greater knowledge of program internals and the possible intentions of the programmer has been the goal since the first Integrated Development Environments (IDEs) were created.

Big, fancy IDEs that need big, fancy displays are moving in the wrong direction.  As the computing world moves away from the PC-centric toward the tablet-based future, our interaction will become much more focused.  The display-and-keyboard will adapt to the operations that we really want to perform, and will retreat from the overwhelming offering of everything we might want to do.  Adaptive keyboards, touch screens and more compact, high resolution presentations provide opportunities to completely rethink the developer interface. 
  • Build
Ultimately, the purpose of the program source is still to create a working application program.  This means that the Programming Language parts of the Source must be submitted to a language processor.  In many cases, this is not a trivial operation.  Many separate functions or modules must be combined, their linkages validated and a final product produced.  This build process can be time-consuming and can result in errors that are the ultimate responsibility of diverse members of the development team.

Expecting the build process to be performed on individual desktop computers is something that needs to be reevaluated as we move into the 21st century era of cloud storage and cloud computing. 
  • Test
After a successful build developers generally expect to test the new application.  This may require transferring the newly created code to a test environment, which may be a particular hardware device or a software simulator.  There should exist a suite of test cases for the "finished" application.  Although some things can be automated, in many cases manual interaction and visual aesthetics will be important factors.  Handling these in a consistent manner, and ensuring that full testing is actually performed for each build or release candidate is a serious problem in the development process.  Maintaining and documenting both known-good and known-bad test cases is of critical importance and can be overwhelming for complex projects.

The source-level run-time debugger is one of the great advances in PC-based computing and is a major feature of all Integrated Development Environments.  Unfortunately, the multiple-programming-language nature of modern programs limits the actual usefulness of the Debugger.  While I occasionally use a debugger for certain compute-only functions, and neophyte programmers really benefit from the capability, I believe that the  problems of client-server interactions, dynamic scripts and cross-platform compatibility are more important challenges that cannot be addressed by a simple debugger.  Therefore, the debugger capability should not be viewed as being of critical importance when evaluating future development tools and environments.

I believe that every function or statement should have available an easily-accessible library of test cases.  This would be sort-of like using a current debugger to run to a breakpoint and then be able to examine and modify the data structures as they are processed.

Knowing the working-set (data structures that are accessed or modified) used by by any particular function is of critical importance to verifying the correctness of a program.  In general, the compilers know this, although sometimes it cannot be determined until run-time.  Unfortunately, this critical knowledge is never made available to the developer.

The test-case library would take the place of ancillary test programs used during the development process.  Currently, such test programs are created, as-needed, by individual developers.  They are never documented, maintained or shared.  Even worse, they are often discarded once the developer feels that his function is "working".

The concept of Assertions in various programming languages are used to catch errors at run-time when selected data does not match expectations.  I would extend the Assertion concept by allowing the capture of a function's working-set - at run-time - and saving it as an addition to the test case library.  This allows the collection of robust sets of real-world data.  When the function is modified or replaced with code that should be equivalent, these collected test cases can provide input to an automated verification process.

What Does the Future Hold?
 
"Keyboards" should be viewed as "Token Selectors" instead of letter-by-letter token builders. IDEs try to do this in some cases, but fail because of (1) the multiple-language problem, and (2) the "word-ish" nature of tokens needed to convey meaning to people.

A single font size doesn't fit all.  Nested structures could be displayed with smaller fonts, so you could literally zoom in to move deeper into the code.  Current tree-collapsing displays give an essentially useless all-or-nothing display.

Need more screen space?  Get a second iPad.  Properly done, this could not be more expensive than multiple monitors and should be much more useful in the general sense.

All source should reside in the cloud.  Current Version Control Systems make much ado about "Checking Out" and "Committing" changes to modules.  This makes it (intentionally) impossible for multiple developers to share work on individual modules.  Storing all development trees in the cloud and changing the checkout process from a "download it to my computer" to a "mark this group of changes as pending" in the cloud concept would be a great improvement.

All Build operations should be performed in the cloud.  There should be no need for any compute-intensive operations at a user's desktop.  This allows high-powered, dynamically-allocated resources to be brought to bear on what should be an independent background task. 

What Could Replace Source Files?

Given the ways current source files are used, the multiple-language problem, the need for collaborative work, and the trend away from desktop computing it seems obvious that we are actually referring to a Database application.  We have been using the computer's simple file system to store and access our programs, even though that has never been a particularly appropriate technique.

What is needed is a Cloud-based database storage system which is accessed by multiple Very Thin Clients (read iPad Apps) that provide the user interaction during the software development process.  This inherently allows world-wide collaboration among the development team.

The compile and build process is performed as a Cloud-based service which provides additional elements to the Database.  These elements would include error and diagnostic information, compiled code, linkage information and target application code.  Automated test cases could be run and results verified.  All these capabilities would be available to every development team member, without having to have separate instances of hardware, tools, environments, etc.

I have some very specific ideas about how such a specialized software development database would be structured, implemented and accessed.  Of course, it has nothing to do with relational databases, SQL, or  traditional Client-Server techniques.

 I Am Requesting Some Feedback

For now, I am interested in feedback from as diverse a set of developers as I can manage.

Specifically, please let me know how you develop you code.

What types of applications do you develop?

What hardware do you use?

What size screens do you need?

What applications do you need to run?  Editors? Compilers? Debuggers? Simulators? Browsers? Documentation tools? Email?  Chat?

How many things do you try to do at once?  How many things do you absolutely need to do at once?

If you were going to work using a tablet computer, what would you see as advantages?

What would be disadvantages, even given your idea of a perfect tablet-of-the-future?

What do you think would be impossible to do using a tablet?


No comments:

Post a Comment