- Languages and Machines
- Compilers and Higher Level Languages
- Modules: Software Decomposition
- Combining Physical Modules: Static and Dynamic Linking
- Related Articles:
- Derivative Works
In this article, we describe and define a number of concepts underlying modern computer systems technology. Understanding these concepts is central to resolving a number of legal issues that arise in the context of open, hybrid, and proprietary models of software development and distribution.
Languages and Machines
Machine code is a sequence of instructions expressed in the native representation of a particular computing machine. By native, we mean that the target machine can consume and execute the instructions contained in a particular sequence of machine code with a minimum of effort. A sequence of machine instructions is sometimes called a program or executable. Machines are generally defined by reference to an abstraction, called an instruction set architecture, or ISA. The ISA is an interface to the machine, meaning that it defines the format (or syntax) and meaning (or semantics) of the set of instructions that a given machine can execute. The syntax and semantics that defines the set of allowable machine codes and their meanings is sometimes called machine language.
The decision as to how best implement a given ISA is to a large degree an arbitrary one. Historically, ISAs have been implemented in hardware, meaning that the logic required to consume, decode, and execute machine codes that comply with a given ISA is implemented by means of a fixed circuit. However, this mode of implementation is by no means required. A given ISA may also be implemented by means of a software program, and such an implementation is sometimes called a virtual machine. This software program will itself of course just be a sequence of machine codes that can be understood by another machine, that may or may not also be implemented in hardware or software. For instance, it is possible (and in fact common, nowadays) to implement a given machine that can execute programs written in machine language X by writing a program in machine language Y that runs on (is consumed and executed by) a machine that is implemented in hardware.
The good thing about machine languages is that they are typically relatively simple in terms of syntax and semantics. The set of allowable machine instructions is generally small, and the allowable formats for expressing machine instructions are typically few. These limitations make it relatively easy to build (particularly in hardware) a machine that implements a given machine language.
The bad thing about machine languages is the same as the good thing—that is, that they are very simple. This makes them inefficient modes of communication, at least from the perspective of someone who wishes to write programs that do interesting things. Intuitively, a language containing only 100 words will be much less expressive than one containing 100,000 allowable words, because it should take fewer words to express a given concept in the latter (large) language than in the former (small) language. Also, because machine languages are designed to be easily read and decoded by mechanical means, they are often represented by way of alphabets that are not natural for humans to read or write. The most common format used is binary—that is, all machine codes are represented only by strings of ones and zeros. The choice of alphabet is again an arbitrary decision. Historically, however, binary has ruled because it is relatively straightforward to build hardware circuits that function on high and low voltages which map nicely to ones and zeros.
Compilers and Higher Level Languages
These drawbacks with machine code led computer scientists to write programs (initially, painstakingly in machine code) to translate programs written in higher-level programming languages into programs consisting only of machine code. These programs are called compilers. A compiler is just a program that takes as input a code sequence in one language and translates and outputs a corresponding code sequence in second language. That second language will typically be machine code, but it need not be. The second language is often called the source language and is typically of a much higher level than machine language. That is, its alphabet is comprised of symbols that are familiar to humans, such as alphanumeric characters; its built-in commands or instructions will be represented in words (e.g. "ADD") or symbols (e.g. "+") that have conventional meanings; and they provide for the straightforward expression of useful, commonly occurring programming idioms, such as repeated execution or functional decomposition.
Modules: Software Decomposition
There are two ways to think about program decomposition. First, all modern programming languages provide some mechanism for logical decomposition. The core principle behind logical decomposition is the separation of interface (sometimes also called specification) from implementation (sometimes also called representation). The interface to a module is everything and anything that a client of that module needs to know to use the module. The interface is a description of what the module can do. The implementation of a module, on the other hand, is the inner workings (generally data structures and instructions) that actually perform the work promised by the interface description. Consider, for example, a television. The interface consists of the switches, knobs, and plugs on the outside of the television. The implementation consists of the guts of the television. As users, we need only to understand the interface: how to plug it in, turn it on, and twiddle the knobs. We don't need—indeed, probably don't want—to know how the television works its magic. By decoupling the interface from the implementation, separate parties can work on separate parts of large system by knowing only about agreed-upon interfaces. Television set designers only need to know about the interface to the electrical system—the shape of the plug and the voltage provided—in order to design a television set that any consumer will be able to plug into their wall socket. They do not know or care about how the electricity is generated, where it comes from, and so on.
The most basic form of logical decomposition is procedural or functional decomposition. Every modern, high level programming language supports functional decomposition in some manner. A function is just an abstraction for a set of instructions to perform a common task, which has been given a convenient name by the programmer. For instance, a programmer may create (define) a function called area that knows how to compute the area of a rectangle. In defining this function, that programmer has in some sense added a new word to the vocabulary of the programming language. The programmer can later invoke the function simply by using that new word in their program. Other forms of logical decomposition include modules, abstract data types, class definitions, and packages. All modern, high level programming languages support one or more of these forms of logical decomposition. While the details of these forms is beyond the scope of the article, they are all methods for creating software components that are larger than single functions or procedures.
In addition to logical decomposition, modern programming languages allow for physical decomposition. The most common mechanism for performing this kind of decomposition is by breaking a single large program into a series of files. Each file will typically contain one or more logical program components. For example, a single file might contain a group of related functions that can then be called from within other source files. Some mechanism for program decomposition is central to the ability to build large programs. Without it, it is hard to imagine how two or more programmers could work on a single program at the same time.
Combining Physical Modules: Static and Dynamic Linking
Program decomposition, however, complicates the process of source to machine code translation. In a world where every program is contained in a single source file, the compiler knows everything it needs to know to translate the program from source to machine code. In a world where a program is broken down into two or more source files, the compiler will not know how to properly translate function invocations that reference functions that are defined in other source files. At a high level, a compiler needs to translate a function invocation into a "jump" to a different instruction address within the executable program. However, the compiler cannot know the machine address of a defined function unless it knows the layout of the entire executable, which it cannot know until all the source files have been compiled and merged.
The typical solution to this problem is simply to defer the resolution of function names to machine addresses to another program, known as a static linker. In this scheme, the compiler translates a given source file into a corresponding object file. An object file contains machine code, as well as some information about functions and other names that are defined by (or functions and other names that are referenced by) the given object file. A second program, called a linker, is then used to make a final resolution of names to numerical addresses. At a high level, the linker takes all of the object files, and for each object file stitches up unresolved references to names that are defined in other object files. After stitching up these names, it concatenates the various machine code sequences into one long machine code sequence, which is now fully executable by the target machine. If it encounters names that are not defined in any object file, it typically reports an error, because this means that the programmer of some object file has attempted to invoke a function or otherwise reference names that are not defined in any object file that comprises the executable.
Static linking works well until we consider that many programs share a considerable amount of standard code. This commonly used, shared code is typically placed into libraries, which are seldom changed, well-known repositories for useful functions that are used by a wide variety of programs. Examples include libraries for performing input and output (from keyboards and other input devices and to the screen or other output devices), mathematical functions (such as sine, cosine, logarithms, etc), and so on. One way to think about a library is as a widely-advertised, well-known object file.
Consider what happens when a number of programs are each linked against a given library. First, this approach is wasteful in terms of disk space, because each program contains a section of identical code. While this may not seem significant in today's world of large disks, it does begin to pose a problem on systems with limited storage, where hundreds or even thousands of programs are all statically linked with a given library or libraries. Second, suppose that the vendor wishes to fix a bug in a given library. The vendor can redistribute the library, but then the end-users would have to re-link that library to each of the hundreds or thousands of programs that may use it. Alternately, the vendor can relink the programs for the user, but they may not be in a position to do so if some of the programs are licensed from third parties. And even if the vendor were willing to relink, they would now need to re-distribute and install a large number of programs, rather than just a single library.
The solution to these problems is to employ dynamic linking. In a world of dynamic linking, programs are not linked until run-time—that is, external name references are left unresolved until the program is loaded, or even until a given function is referenced for the first time. Dynamically linked programs incur some cost at run time—either due to longer startup times because the linking is done when the program is loaded, or as a small cost each time an unresolved name is referenced for the first time. The benefit of dynamic linking, of course, is that multiple programs can now truly share the same library—each system will only have a single copy of each dynamically linked library shared by a plurality of running programs. For this reason, dynamically linked libraries are sometimes also called shared libraries (in the *N*X world, for instance). Another benefit is that upgrading a dynamically linked library (e.g. due to a bug-fix) is often as simple as just replacing the library on disk, and restarting the system. Because programs are automatically linked each time they are run, no tedious static re-linking of programs is required.
Static and dynamic linking are instances of a more general concept known as binding. Binding can be thought of as the process of associating, in some way, names with values in a software system. Systems can be distinguished based on when they perform various kinds of binding. In the case of static linking, function names are bound to numerical function addresses more or less at compile time (or more precisely, after compile time, but before run-time). In the case of dynamic linking, binding doesn't take place until run-time.
Internet domain name resolution is another example of the binding concept. The Domain Name Service (DNS) allows textual, easy-to-remember domain names (like "yahoo.com") to be translated into numerical IP addresses at run-time. Deferring binding has the benefit of making systems more flexible at the cost of some run-time overhead. In the DNS example, imagine if domain names were translated at compile time. In that case, anytime someone wanted to re-map a domain name to a new IP address, all programs that referenced that name would have to be re-compiled. Moreover, a large class of networking programs, such as web-browsers, would be dependent upon relatively static associations between domain names and IP addresses, greatly limiting their utility.
Object oriented programming languages are another example of systems that rely on late binding. One of the hallmark features of object oriented programming languages is subtype polymorphism. A programming language that is subtype polymorphic allows a programmer to write code that will function with data objects of a type that is not known until runtime. Object oriented programming languages do not depend on knowing the particular representation of objects at compile time. This means that programmers may write code that is highly abstract, in that it works with many different (concrete) types of objects that all share common interfaces. This flexibility and expressiveness, however, comes at the cost of deferring binding until runtime. Specifically, the process of looking up and determining the correct behavior for a given object of a particular type is deferred until runtime. This process is in many ways analogous to that of dynamic linking that occurs on a "demand basis" (as opposed to dynamic linking that takes place at load-time). It differs from demand-linking, however, because object oriented systems allow for the binding to be different every time a particular function is invoked—in other words, the particular behavior for a given object must be (in the worst case) re-determined every time that behavior is invoked (because it is not known if the current object is or will be the same as the subsequent object).
Binding, modularity, and interfaces may seem like abstract computer science concepts, but as other articles on this website will demonstrate (see GPL's treatment of derivative works, for example), they are critical to correctly resolving legal issues that arise in this context.
The author would take the following three books onto a desert island with him:
- Aho, Alfred V. & Ullman, Jeffrey D. (1992), The Foundations of Computer Science, Computer Science Press.
- Abelson, Harold & Gerald Jay Sussman with Julie Sussman (1985), Structure and Intrpetation of Computer Programs, The MIT Press.
- Patterson, David A. & John L. Hennessy (1998), Computer Organization and Design: The Hardware/Software Interface, Morgan Kaufmann.