Arrays for C
Early draft proposal

John Nagle
Animats


Safer arrays for C - a proposal
(DRAFT 0.4)

 

Abstract

For many years, C programs have suffered from pointer problems. We offer a way out which is backwards compatible.The goal here is to extend C in a way which allows efficient subscript checking without losing backwards compatibility and without impacting performance significantly.

Introduction

Every day, as they have for decades, millions of computers around the world crash. Most of these crashes occur due to flaws in code written in the C programming language. The errors which cause the crashes usually involve unsafe pointer operations.

This problem is not inherent in programming at the level of C. Specific flaws in the pointer model of the C language are the source of the difficulty.. Ada, for example, is not afflicted with this problem. Yet because of the enormous installed base of C code, this problem has seemed insoluble in practice. It is not. There is a way out of this mess.

We offer a way out which is backwards compatible.The goal here is to extend C in a way which allows efficient subscript checking without losing backwards compatibility and without impacting performance significantly.

Previous work

We've chosen to address this problem for C, rather than C++. This is for practical reasons. A sizable portion of the open source community has rejected C++ to stay with C. The Linux kernel remains a C program, not a C++ program.

C++ attempts to address the memory safety issue, but the mechanism chosen is to paper over the problem by encapsulating C arrays inside collection classes defined by templates. Unfortunately, the STL collection classes "leak"; they allow access to C pointers which can be and which are misused. The prevalance of libraries and system calls with C interfaces also requires that direct access to C arrays be provided in C++.

Efforts have been made to develop useful static analysis tools to detect subscript errors in C (not C++). ***MORE***

The problem: Array == Pointer

In C, the declarations

float tab[ ];
and
float* tab;

are equivalent. This has its uses, but since no size information, or even whether a pointer refers to an array or a single element, is carried along with an array, it is a source of error.

C does, of course, allow

float tab[100];

but the size must be a compile-time constant. To pass that array to a function, one would declare a function

void fn(size_t n, float tab[ ])
{
...
}

with the size of tab unknown to the language, although hopefully known to the programmer. This construct is the source of most of the buffer overflow problems in the world today.

C99 extensions

C99 introduced "variable length arrays", using this syntax:

void fn(size_t n)
{ float tab[n];
...
}

This feature was provided to allow variable length arrays of storage class auto, or "on the stack" arrays, which made simple numerical subroutines much more straightforward to write. (See the original edition of "Numerical Recipes in C" for the contortions required for temporary arrays prior to C99.) However, the only context in which this construct is allowed is for "auto" variables. This is limiting.

C99 also introduced "static parameters":

void fn(static float tab[10])
{
...
}

At last, it's possible to talk about the size of an array parameter in C. But C99 limits the size to a constant.

Our proposal

We propose to allow the C99 variable length array syntax in more contexts, including formal function arguments:

void fn(size_t n, static float tab[n])
{
...
}

Here, we've provided a way to express in C what is really going on - an array of size n is being passed. Note that the information about the size of the array is not bundled with the array; the expected implementation remains the same. But the compiler now knows the size of the array.

So what is this good for? Subscript checking, call checking, and simplifying array manipulation.

As with C99 variable length arrays, sizeof (and, in most implementations, lengthof) become variable-valued functions which report the size in bytes and the length in elements of the array.

As a convenience, we suggest that the scope of formal argument declarations within a function prototype cover the entire function prototype, including the portion of the prototype preceding the declaration. This allows the common C idiom

void fn(float tab[n], size_t n)
{
...
}

and, of course, the common declaration

int read(int fd, char buf[n], size_t n);

None of this involves changing the run-time representation of arrays. This is entirely a compile time construct.

Subarrays

It's common in C to talk about a portion of an array using pointer operations. The usual syntax is

float tab[100];
float* tab2 = tab + 20

tab2 then represents an array beginning at tab[20], of undefined length.

We propose the syntax

float tab[100];
float tab2[80] = tab[20:100];

to express such things. The number before the colon is the index of the start of the subarray, and the second number is the index of the end + 1. (This follows Python's notational convention.) Again, the run-time representation is the same, and this is still a pointer assignment, not an array copy. We're now associating a length with "tab2", one which can be retrieved with lengthof or used for checking. In the example above, lengthof(tab) is 100 and lengthof(tab2) is 80.

Initialization

The common idiom

char s[ ] = "Hello, world.";

needs to be dealt with. This is actually a special case of array initialization:

float tab[ ] = {1.0, 1.5, 2.0} ;

This defines a fixed length array, so that the above is equivalent to

float tab[3] = {1.0, 1.5, 2.0};

and the first statement above is equivalent to

char s[14] = "Hello, world.";

These are now considered fixed-length arrays.

Left-side values

Assignment semantics present a problem. "Array assignment" in C usually means copying a pointer, not copying the array. The following is not allowed:

float tab1[10];
float tab2[10];
float tab3[ ];
float* p = 0;

...
tab2 = tab3; /* not allowed, left side not L-value */
tab3 = tab2; /* not allowed, size mismatch */
p = tab2; /* allowed, but array information lost */
typedef float float10[10]; /* array of size 10 */
float10* q;
q = &tab2; /* must explicitly take "address" of array */

This does not extend well to sized arrays . In C, arrays have an implicit conversion such that tab converts to &tab[0]. This builds the "array=pointer" convention into the language. We need to get past that.

References (which C does not currently have) can be used to work around this problem. One can write

typedef float float3[3];
float3& tab2a
= tab2; // create ref to array

References would help, although they can't be NULL.

C++ style references to arrays have some useful properties. It is permitted to assign a pointer of an element type from a reference to an array. So this common C idiom works:

void fbcopy(float (&dst)[n], float (&dst)[n], size_t n)
{
float* srcp = src;
float* dstp = dst;
while (n--) *dstp++ = *srcp++; /* the classic obscure way to copy an array */
}

Note, though, that while a pointer can be taken from a reference, references themselves cannot be incremented.

For checking purposes, we should view such pointers as having the restrictions of iterators - the pointer must point to an element within the array, one element beyond the array, or be null.

Auto

A proposed addition to C++ allows the "auto" keyword to simplify declarations. The newly declared variable takes the type of the initialization expression. Adding this feature to C makes initialization of sized objects simpler:.

auto s = "Hello, world.";

This eliminates the need to count characters.

Casts

Conversions will, of course, be necessary. As above, we extend the existing syntax to allow variables where constants were previously required. We can then write

float tab1[10];
size_t n = 5;
typedef float floatarrayn[n]; /* creates array type with n float elements */
floatarrayn tab2 = (floatarrayn) tab1; /* shallow copy, downcasts array */

Existing C syntax forces us to use the usual unwieldy "array typedef" form. It would be tempting to extend type declaration syntax to allow

float[n] tab; /* array of length */

so that we could then write casts like

(float[n])

but this is not essential.

This needs further discussion.

Void

To reduce the need for "void *", which bypasses most type checking, we would add two new types:

void_t

which, like void, represents data of unknown type. sizeof(void_byte_t) == 1, and arrays of void_t are permitted.

void_zero_t

which is a void_t whose value must be zero. (These are merely names for discussion; in practice, "void_t" is seen in existing code, and a different keyword may be preferable.)

Conversion rules are as follows:

  1. void_zero_t and arrays thereof can be converted to any type of equivalent size.
  2. void_t and arrays thereof can be converted to any type of equivalent size other than a struct which contains one or more pointer types.

The standard malloc function would then be declared:

void_t[size] malloc(size_t size);

and the standard "bzero" function would be defined as

void_zero_t[n] bzero(void_t s[n], size_t n);

Checking

Given this information, we can perform subscript checking. The following checks should be made:

  1. For subscript expressions, if lengthof is defined for the array being subscripted, the subscript must be within the range 0 .. lengthof(arr) -1.
  2. For function calls, the length of the formal argument must be less than or equal to the length of the actual argument.
  3. For assignments, the length of the LHS must be less than or equal to the length of the RHS.

This is backwards-compatible with correct existing code, since checking is only enabled for arrays with either constant length or for which the new syntax is being used.

Extension to structures

A classic idiom in C is:

struct buf {
size_t len;
char buffer[1];
};

This usually indicates a buffer of variable length. But C provides no way to talk about a variable length item within a structure, so the programmer is obliged to lie to the language about it. We would propose

struct buf {
size_t len;
char buffer[len];
};

As with the classic idiom, the variable length array can appear only as the last item of the structure.

The type declaration becomes a bit unusual. We need a type like this:

size_t n = 100;
struct buf ourbuffer(n);

Allocation

To allocate a variable length array, the syntax would be

size_t n = 100;
float tab[n] = (float[n]) malloc(sizeof(float)*n);

It would be tempting to introduce the "new" syntax from C++, so that one could simply write

float tab[n] = new float[n];

but that might be controversial. Certainly one could have

#define NEW(TYPE,LENGTH) (TYPE[(LENGTH)])malloc(sizeof(TYPE)*(LENGTH))
...
float tab[n] = NEW(float,n);

which avoids conflicts with C++.

Discussion

This is a limited extension to C, not an attempt to make C into C++ or arrays into first-class objects. It's still not possible to copy a variable length array with the assignment operator, or pass an array by value. The goal here is to provide the minimal backwards-compatible extension which allows the programmer to say within C the size of an array.

Strict mode

The next step is "strict mode", an optional mode in which the checking is stricter, and sufficient to prevent buffer overflows. In strict mode, we disallow conversions between pointers and arrays. In this mode, a pointer means a pointer to a single object, never an array. Some new restrictions apply:

  1. Subscripting a pointer is prohibited.
  2. Casts from pointers to arrays are normally disallowed.
  3. Pointer arithmetic is prohibited except when all the following hold:
    1. The pointer has storage class "auto" and local scope.
    2. All assignments to the pointer are from the same array.

These restrictions allow subscript checking for pointer arithmetic. The compiler can associate the pointer with a single array, and thus perform subscript checking on it.

Rule 3 allows the classic C idiom

void bcopy(char dest[n], char src[n], size_t n)
{ while (n-- > 0) *dest++ = *src++; }

The compiler can interpret this unambiguously as the equivalent form

void bcopy(char dest[n], char src[n], size_t n)
{ destix = 0; srcix = 0; while (n-- > 0) dest[destix++] = src[srcix++]; }

This form is suitable for subscript checking. A clever compiler could hoist the subscript checks out of the loop.

It seems desirable to allow pointer arithmetic in the cases where it can be easily checked, to support legacy code in strict mode to the greatest extent possible. Where pointer arithmetic cannot be checked, though, it must be disallowed for safety.

Problems

Variable-length null-terminated strings

It's not clear what to do about the common idiom of null-terminated constant strings. This is serious enough that we may want to allow unchecked subscripting for "const" objects. Errors from such operations can't corrupt memory, although they may read junk.

 

November 15, 2008