It was recognized that there was a premium in being able to get the compiler up and working quickly; it was also felt, however, that this was in many ways less important than being able to evolve and tune the compiler into a high-quality product as time went on. Particularly with operating system code, a "quick and dirty" implementation is simply unacceptable.
It was also recognized that the compiler was likely to be applied to machines not well understood by the compiler writer that might have inadequate or nonexistent debugging facilities. Therefore, one goal of the compiler was to permit it to be largely self-checking. Rather than produce incorrect code, we felt it far preferable for the compiler to detect its own inadequacies and reject the input program.
This goal was largely met. The bug level has remained low, even as the compiler has begun to be more carefully tuned; many of the bugs have resulted from human error e.
Several techniques contribute considerably to the general reliability of the compiler. First, a conscious attempt was made to separate information about the machine e. Thus, as the compiler evolves, more effort can be put into improving the heuristics and the recognition of important special cases, while the underlying knowledge about the machine operations need not be altered.
This approach also improves portability, since the heuristic programs often remain largely unchanged among similar machines, while only the detailed knowledge about the format of the instructions encoded in a table changes. During compilation of expressions, a model of the state of the compilation process, including the tree representing the expression being compiled and the status of the machine registers, is maintained by the compiler. As instructions are emitted, the expression tree is simplified.
The possible transformations constitute the "facts" about the machine: the order in which they are applied correspond to the heuristics. When the input expression has been completely transformed into nothing, the expression is compiled. Thus, a good portion of the initial design of a new version of the compiler is concerned with making the model within the compiler agree with the actual machine by building a table of machine operations and their effects on the model.
When this is done correctly, one has a great deal of confidence that the compiler will produce correct code, if it produces any at all. Another useful technique is to partition the code generation job into pieces that interact only through well-defined paths.
One module worries about breaking up large expressions into manageable pieces, and allocating temporary storage locations when needed. Another module worries about register allocation. Finally, a third module takes each "manageable" piece and the register allocation information, and generates the code. The division between these pieces is strict; if the third module discovers that an expression is "unmanageable," or a needed register is busy, it rejects the compilation.
The division enforces a discipline on the compiler which, while not really restricting its power, allows for fairly rapid debugging of the compiler output. The most serious drawback of the entire approach is the difficulty of proving any form of "completeness" property for the compiler--of demonstrating that the compiler will in fact successfully generate code for all legal C programs. Thus, for example, a needed transformation might simply be missing, so that there might be no way to further simplify some expression.
Alternatively, some sequence of transformations might result in a loop, so that the same expression keeps reappearing in a chain of transformations. The compiler detects these situations by realizing that too many passes are being made over the expression tree, and the input is rejected. Unfortunately, detection of these possibilities is difficult to do in advance because of the use of heuristics in the compiler algorithms.
Currently, the best way of ensuring that the compiler is acceptably complete is by extensive testing. The compiler, assembler, and loader ran on the PDP, and the resulting executable files were transferred to the Interdata for testing. Primitive routines permitted individual characters to be written on the console. In this environment, the basic stack management of the compiler was debugged, in some cases by single-stepping the machine.
This was a painful but short period. After the function call mechanism was working, other short tests established the basic sanity of simple conditionals, assignments, and computations. At this point, the stand-alone environment could be enriched to permit input from the console and more informative output such as numbers and character strings, so ordinary C programs could be run. We solicited such programs, but found few that did not depend on the file system or other operating system features.
Some of the most useful programs at this stage were simple games that pitted the computer against a human; they frequently did a large amount of computing, often with quite complicated logic, and yet restricted themselves to simple input and output. A number of compiler bugs were found and fixed by running games. After these tests, the compiler ceased to be an explicit object of testing, and became instead a tool by which we could move and test the operating system. Some of the most subtle problems with compiler testing come in the maintenance phase of the compiler, when it has been tested, declared to work, and installed.
At this stage, there may be some interest in improving the code quality as well as fixing the occasional bug. An important tool here is regression testing; a collection of test programs are saved, together with the previous compiler output. Before a new compiler is installed, the new compiler is fed these test programs, the new output is compared with the saved output, and differences are noted. If no differences are seen, and a compiler bug has been fixed or improvement made, the testing process is incomplete, and one or more test programs are added.
If differences are detected, they are carefully examined. The basic problem is that frequently, in attempting to fix a bug, the most obvious repair can give rise to other bugs, frequently breaking code that used to work. These other bugs can go undetected for some time, and are very painful both to the users and the compiler writer.
Thus, regression tests attempt to guard against introducing new bugs while fixing old ones. The portable compiler is sufficiently self-checked that many potential compiler bugs were detected before the compiler was installed by the simple expedient of turning the compiler loose on a large amount tens of thousands of lines of C source code. Many constructions turned up there that were undreamed of by the compiler writer, and often mishandled by the compiler.
It is worth mentioning that this kind of testing is easily carried out by means of the standard commands and features in the UNIX system. In particular, C source programs are easily identified by their names, and the UNIX shell provides features for applying command sequences automatically to each of a list of files in turn.
Moreover, powerful utilities exist to compare two similar text files and produce a minimal list of differences. Finally, the compiler produces assembly code that is an ordinary text file readable by all of the usual utilities. Taken together, these features make it very simple to invent test drivers. For example, it takes only a half-dozen lines of input to request a list of differences between the outputs of two versions of the compiler applied to tens or hundreds of source files.
Perhaps even more important, there is little or no output when the compilers compare exactly. On many systems, the "job control language" required to do this would be so unpleasant as to insure that it would not be done. Even if it were, the resulting hundreds of pages of output could make it very difficult to see the places where the compiler needed attention.
The design of the portable C compiler is discussed more thoroughly in Ref. We were favorably impressed, even in the early stages, by the general ease with which C programs could be moved to other machines. Some problems we did encounter were related to weaknesses in the C language itself, so we undertook to make a few extensions.
C had no way of accounting in a machine-independent way for the overlaying of data. Most frequently, this need comes up in large tables that contain some parts having variable structure. As an invented example, a compiler's table of constants appearing in a source program might have a flag indicating the type of each constant followed by the constant's value, which is either integer or floating.
The C language as it existed allowed sufficient cheating to express the fact that the possible integer and floating value might be overlaid both would not exist at once , but it could not be expressed portably because of the inability to express the relative sizes of integers and floating-point data in a machine-independent way.
Therefore, the union declaration was added; it permits such a construction to be expressed in a natural and portable manner. Declaring a union of an integer and a floating point number reserves enough storage to hold either, and forces such alignment properties as may be required to make this storage useful as both an integer and a floating point number.
This storage may be explicitly used as either integer or floating point by accessing it with the appropriate descriptor tag. Another addition was the typedef facility, which in effect allows the types of objects to be easily parameterized. Unlike some languages, C does not permit definition of new operations on these new types; the intent was increased parameterization rather than true extensibility.
Although the C language did benefit from these extensions, the portability of the average C program is improved more by restricting the language than by extending it. Because it descended from typeless languages, C has traditionally been rather permissive in allowing dubious mixtures of various types; the most flagrant violations of good practice involved the confusion of pointers and integers.
Some programs explicitly used character pointers to simulate unsigned integers; on the PDP the two have the same arithmetic properties. Type unsigned was introduced into the language to eliminate the need for this subterfuge. More often, type errors occurred unconsciously. For example, a function whose only use of an argument is to pass it to a subfunction might allow the argument to be taken to be an integer by default.
If the top-level actual argument is a pointer, the usage is harmless on many machines, but not type-correct and not, in general, portable. Violations of strict typing rules existed in many, perhaps most, of the programs making up the entire stock of UNIX system software.
Yet these programs, representing many tens of thousands of lines of source code, all worked correctly on the PDP and in fact would work on many other machines, because the assumptions they made were generally, though not universally, satisfied.
It was not feasible simply to declare all the suspect constructions illegal. Instead, a separate program was written to detect as many dubious coding practices as possible. This program, called lint , picks bits of fluff from programs in much the same way as the PFORT verifier mentioned above.
C programs acceptable to lint are guaranteed to be free from most common type errors; lint also checks syntax and detects some logical errors, such as uninitialized variables, unused variables, and unreachable code. There are definite advantages in separating program-checking from compilation. First, lint was easy to produce, because it is based on the portable compiler and thus shares the machine independent code of the first pass with the other versions of the compiler.
More important, the compilers, large programs anyway, are not burdened with a great deal of checking code which does not necessarily apply to the machine for which they are running. A good example of extra capability feasible in lint but probably not in the compilers themselves is checking for inter-program consistency. The C compilers all permit separate compilation of programs in several files, followed by linking together of the results.
Finally, lint itself is a portable program, identical on all machines. Although care was taken to make it easy to propagate changes in the machine-independent parts of the compilers with a minimum of fuss, it has proved useful for the sometimes complicated logic of lint to be totally decoupled from the compilers. This kind of separation of function is characteristic of UNIX programs in general. The compiler's one important job is to generate code; it is left to other programs to print listings, generate cross-reference tables, and enforce style rules.
The UNIX operating system kernel, or briefly the operating system, is the permanently resident program that provides the basic software environment for all other programs running on the machine. It implements the "system calls" by which user's programs interact with the file system and request other services, and arranges for several programs to share the machine without interference. The structure of the UNIX operating system kernel is discussed elsewhere in this issue [18, 19].
To many people, an operating system may seem the very model of a nonportable program, but in fact a major portion of UNIX and other well-written operating systems consists of machine-independent algorithms: how to create, read, write, and delete files, how to decide who to run and who to swap, and so forth.
If the operating system is viewed as a large C program, then it is reasonable to hope to apply the same techniques and tools to it that we apply to move more modest programs. The UNIX kernel can be roughly divided into three sections according to their degree of portability. At the lowest level, and least portable, is a set of basic hardware interface routines. Most of them are machine-independent in specification, although not implementation.
Other assembly-language routines are not called explicitly but instead intercept interrupts, traps, and system calls and turn them into C-style calls on the routines in the rest of the operating system.
Each time UNIX is moved to a new machine, the assembly language portion of the system must be rewritten. Not only is the assembly code itself machine-specific, but the particular features provided for memory mapping, protection, and interrupt handling and masking differ greatly from machine to machine.
One reason for this is certainly the usual sorts of difficulties found in assembly-language programming: we wrote loops that did not loop or looped forever, garbled critical constants, and wrote plausible-looking but utterly incorrect address constructions.
Lack of familiarity with the machine led us to incorrect assumptions about how the hardware worked, and to inefficient use of available status information when things went wrong. Finally, the most basic routines for multi-programming, those that pass control from one process to another, turned out after causing months of nagging problems to be incorrectly specified and actually unimplementable correctly on the Interdata, because they depended improperly on details of the register-saving mechanism of the calling sequence generated by the compiler.
These primitives had to be redesigned; they are of special interest not only because of the problems they caused, but because they represent the only part of the system that had to be significantly changed, as distinct from expressed properly, to achieve portability. These programs are, of course, machine-dependent, since the devices are. The drivers caused far fewer problems than did the assembly language programs. Of course, they already had working models on the PDP, and we had faced the need to write new drivers several times in the past there are half a dozen disk drivers for various kinds of hardware attached to the PDP In adapting to the Interdata, the interface to the rest of the system survived unchanged, and the drivers themselves shared their general structure, and even much code, with their PDP counterparts.
The problems that occurred seem more related to the general difficulty of dealing with the particular devices than in expressing what had to be done.
The third and remaining section of the kernel is the largest. This is the operating system proper, and clearly represents the bulk of the code. We hoped that it would be largely portable, and as it turned out our hopes were justified. A certain amount of work had to be done to achieve portability. Most of it was concerned with making sure that everything was declared properly, so as to satisfy lint, and with replacing constants by parameters. For example, macros were written to perform various unit conversions previously written out explicitly: byte counts to memory segmentation units and to disk blocks, etc.
The important data types used within the system were identified and specified using typedef: disk offsets, absolute times, internal device names, and the like. Save my name, email, and website in this browser for the next time I comment. April 4, views. Storage allocation Stereotyped code sequence for switches, labels and subroutine exit and entry points.
Code production for expressions Treating the Portable C compiler as a product family Johnson and Ritchie, What are the Commonalities? The similarities of these languages include: They are system-programming languages.
These languages can easily accommodate translations with relatively simple compilers. They use library routines for input and output and other interactions with operating systems.
These languages are compactly described and are relatively small. These languages use library procedures to specify control structures like procedure closures and co-routines. These languages achieve portability between machines since their abstractions lay at a high level.
B and C languages have no nested procedures, while procedures can be nested in the BCPL language but cannot refer to non-static objects defined in the containing systems. BCPL and C both provide a means through which text from a named file can be include, but the B language does not provide for this.
Lexical and syntactic mechanisms of BCPL are regular and more elegant than the syntactic and lexical means of both the B and C languages. B and C language end their statements with semicolons. Most semicolons in BCPL are elided after statements that end in a line boundary, giving convenience.
Works Cited; C. Johnson, D. Ritchie, S. Johnson, M. The direct benefit of portability is that it is normal for Unix software to outlive its original hardware platform, so tools and applications don't have to be reinvented every few years. Today, applications originally written for Version 7 Unix are routinely used not merely on Unixes genetically descended from V7, but on variants like Linux in which the operating system API was written from a Unix specification and shares no code with the Bell Labs source tree.
The indirect benefits are less obvious but may be more important. The discipline of portability tends to exert a simplifying influence on architectures, interfaces, and implementations. This both increases the odds of project success and reduces life-cycle maintenance costs.
In this chapter, we'll survey the scope and history of Unix standards.
0コメント