Strict mode for C++ - early draft proposal

Strict mode for C++
Early draft proposal

C++ 200x - a proposal for a "strict mode"

This proposal is for the next generation of C++, targeted for the next major revision cycle of the language. This is a very early draft for comment. After a go-round with the online C++ community, the proposal has been simplified considerably. This is round 2.

The proposed approach is a small number of changes to the C++ language which enable the safe encapsulation of pointer operations. The suggested encapsulation involves "smart pointers" and a version of the STL with subscript and iterator checking.

Smart pointer safety

Reference counting has a good track record with C++, in the form of "smart pointers". If better integrated with the language, it can be made safe. That is the essence of this proposal.

Many C++ smart pointer implementations exist. They share a common weakness. Using a smart pointer requires obtaining a raw pointer from the smart pointer. Once a raw pointer has been obtained, it can be used in ways that break the smart pointer system. This is a language-level problem and cannot be fixed effectively through class libraries alone.

The minimal change required to the language is the addition of a new data attribute that provides the necessary protection. The underutilized auto keyword seems appropriate. The basic concept is that pointers and references explicitly declared as auto can't be used in ways that would let the data they contain outlive the scope of the auto variable. Specifically,

auto data items cannot be assigned values of scope less than the auto data item. This is the key restriction.
auto data items must be initialized to some valid value.
The built-in arithmetic operations on pointers which return pointer results (+,-,*,/,++,--,[], etc.) are to be defined as taking non-auto arguments, thus disallowing pointer arithmetic on auto pointers.
Non-auto pointers can be assigned to auto pointers, but auto pointers cannot be assigned to non-auto pointers.
auto is permitted as an attribute on function arguments.

This allows smart pointers implemented via templates to protect themselves. Whenever a smart pointer implementation needs to return a raw pointer, it should return an auto value. The rules for auto prevent the returned value from outliving the scope of the expression from which it was obtained, preventing dangling pointers.

Some sample code:

void fn1(auto someType* p);     // takes auto arg
void fn2(someType* p);          // takes non-auto arg

void fn3(auto someType* p)
{   auto someType* q = p;       // OK
    fn1(p);                     // OK
    fn2(p);                     // ERROR - auto passed to non-auto
    //    Use of some smart pointer implementation.  This is illustrative only
    smart_ptr<someType> r = smart_new<someType>();  // create new obj and smart ptr to it
    q = r;                      // OK - smart pointer converts to "auto" raw ptr.
    someType* bad1 = r;         // ERROR - auto passed to non-auto
    smart_ptr<someType> t = r;  // OK - smart pointer assignment bumps ref count
    fn1(r);                     // OK - smart pointer converts to "auto" raw ptr.
    {
        someType innerobj;      // a local instance
        auto someType* innerq = &innerobj; // OK - passed to lesser scope
        q = &innerobj;          // ERROR - assigned pointer to inner object to outer scope.
        smart_ptr s = smart_new(); // smart pointer in inner scope
        q = s;                  // ERROR - assigned pointer to inner object to outer scope.
    }           
}

The implications of auto are subtle, but powerful. Programs can use both raw pointers and smart pointers without risk of breaking the smart pointer system. Smart pointers and auto scoped pointers play well together. Smart pointer implementations can be safe, provided they return only auto scoped pointers when needed, because the lifetime of the contents of an auto scoped pointer has been limited.

Note especially that last q = s;. This is the auto scope mechanism protecting a smart pointer. At the end of the inner block, s will be deallocated, and the heap object it points to will go away because its reference count goes to 0. "q" would have been a dangling pointer. That error gets caught at compile time. There's no additional run time overhead for auto scoped objects; it's entirely a compile time check, like const.

The built-in arithmetic operations remain defined for non-auto pointers, but don't accept "auto" arguments. And conversion from non-auto to auto is defined, but auto to non-auto conversion is prohibited. Pointer arithmetic on auto scoped pointers is thus prohibited. "auto" scope allows intermixing auto and raw pointers in the same program, allowing compatibility.

This interpretation of auto is simple to implement in compilers and useful in its own right, as a way to tighten up existing smart pointer libraries. auto should have these semantics all the time. The keyword is used so seldom that this won't break much, if any, code, and if it does, a compile time error is generated.

Strict mode

Almost all the programming languages which postdate C++ are "memory-safe". In such languages, data objects are protected from being overstored from code which should not be able to write to them. LISP was the first language to have this property. Java, Perl, Python, and C# all have it. C and C++ do not. The usual observation is that programming is easier in memory-safe languages, primarily because debugging is much easier. But there is usually a penalty in run-time performance.

Perl, like the C/C++ family, started out as a non-object oriented language but acquired objects later in life. Perl, has a "strict mode", which turns off certain language features considered undesirable or obsolete. This idea is worth borrowing for C++. A "strict mode" for C++ offers a way to tighten up the language for new work without breaking existing code. The specific goal of "strict mode" is to eliminate, as much as possible, "undefined behavior" of programs. The goal is not stylistic. Features disabled in strict mode should be limited to those which, under the existing C++ definition, result in crash-type undefined behavior.

auto, as defined above, is an "always-on" feature. Once we have auto, we need very few additional restrictions to achieve memory safety for pointer operations:

new and delete are unavailable. (They would normally be encapsulated in a smart pointer implementation).
Definition of non-const built-in arrays ("C arrays") is not permitted.
Declaration of unions containing pointers is not permitted.
The unary "&" operator returns an auto pointer.
Pointer values must be initialized.
Dereferencing NULL must result in the throwing of a C++ exception, or, during debugging, halting of the program for debug purposes. Other undefined behavior is not permitted.

These restrictions lock out the creation of raw pointers in strict mode. They don't lock out the use of raw pointers obtained from non-strict portions of the program. This allows interoperability of strict and non-strict code. Such mixed programs are, of course, not safe. Only programs where all compilation units are compiled in strict mode are safe. This provides a migration path to safety while allowing the reuse of existing code.

Built-in arrays ("C arrays") in strict mode

From a a safety perspective, there are three kinds of C arrays: fixed-size, constant null-terminated, and "other". The first two kinds are in principle checkable at run time. The third has to be viewed as a legacy feature not used in strict mode except to interface with existing code.

Fixed-size arrays are identifiable at compile time, and thus are in principle checkable. Such checking requires support in the compiler, but is unambiguous.

class vec3
{
private:
    double n[3];
public: 
    void sum(vec3& vec)
    {   double total = 0.0;
        for (int i=0; i<3; i++)
        {   total += vec[i]; }  // compiler must generate subscript check
        return(total);
    }  
// ...      
}

Note that for most loops, such checks can be optimized out.

String constants, and arrays of unknown size initialized with aggregates, present problems. The syntax of those two constructs is built into C and C++ at a low level, and both are widely used in existing code. Fortunately, most of the valid uses of those constructs involve const data items. So the following compromise is proposed.

const C arrays initialized with string constants or aggregate constants are allowed in strict mode.
Accesses beyond the limits of the data of const C arrays must result either in the return of undefined values, the throwing of a C++ exception, or, during debugging, halting of the program for debug purposes. Other undefined behavior is not permitted.

Thus, it's possible to read junk, but not write it, and reading off the end of an array is recoverable within the program.

int main(int argc, const char* argv[])
{
    for (int i=0; i<argc; i++)
    {   const char* arg = argv[i];
        printf("Arg %i: %s\n",i,arg);
    }
}

This is classic C. Because the arrays involved are const, trouble can be contained. Non-const built-in arrays cannot be declared in strict mode. Thus, printf and fprintf are available, but sprintf and scanf, which store into strings and historically cause trouble, are not. String storage must be done through collection classes in strict mode.

This is a compromise between safety and backwards compatibility.

Conversion of existing programs

Almost all existing C++ programs should compile in non-strict mode. Modern C++ programs written using the STL and some smart pointer library will be convertable to strict mode without much effort. Converting older programs will consist mostly of converting them to use the STL and smart pointers, which is non-trivial but well understood.

When all the compiler errors have been eliminated, the program should be memory-safe, provided that the STL and smart pointer implementations perform appropriate checks. Requirements for template library safety are discussed separately.

Conclusion

This set of easily implemented restrictions makes C++ memory-safe. It retains as much of standard C/C++ semantics as can be retained consistent with safety. The overhead increase is modest provided that code is written to use auto scope pointers in speed-critical sections. Overhead can be reduced further with compiler optimization of checking.

Details

See the sections below. This is an early draft; more will be added. Comments are welcomed, either by mail or in "comp.std.c++".

Rationale

Pointers and safe collections

Code snippets

Related work

June 2, 2010