Strict mode for C++ |
This proposal is for the next generation of C++, targeted for the next major revision cycle of the language. This is a very early draft for comment. After a go-round with the online C++ community, the proposal has been simplified considerably. This is round 2.
The proposed approach is a small number of changes to the C++ language which enable the safe encapsulation of pointer operations. The suggested encapsulation involves "smart pointers" and a version of the STL with subscript and iterator checking.
Reference counting has a good track record with C++, in the form of "smart pointers". If better integrated with the language, it can be made safe. That is the essence of this proposal.
Many C++ smart pointer implementations exist. They share a common weakness. Using a smart pointer requires obtaining a raw pointer from the smart pointer. Once a raw pointer has been obtained, it can be used in ways that break the smart pointer system. This is a language-level problem and cannot be fixed effectively through class libraries alone.
The minimal change required to the language is the addition of a new data attribute that provides the necessary protection. The underutilized auto keyword seems appropriate. The basic concept is that pointers and references explicitly declared as auto can't be used in ways that would let the data they contain outlive the scope of the auto variable. Specifically,
This allows smart pointers implemented via templates to protect themselves. Whenever a smart pointer implementation needs to return a raw pointer, it should return an auto value. The rules for auto prevent the returned value from outliving the scope of the expression from which it was obtained, preventing dangling pointers.
Some sample code:
void fn1(auto someType* p); // takes auto arg void fn2(someType* p); // takes non-auto arg void fn3(auto someType* p) { auto someType* q = p; // OK fn1(p); // OK fn2(p); // ERROR - auto passed to non-auto // Use of some smart pointer implementation. This is illustrative only smart_ptr |
The implications of auto are subtle, but powerful. Programs can use both raw pointers and smart pointers without risk of breaking the smart pointer system. Smart pointers and auto scoped pointers play well together. Smart pointer implementations can be safe, provided they return only auto scoped pointers when needed, because the lifetime of the contents of an auto scoped pointer has been limited.
Note especially that last q = s;. This is the auto scope mechanism protecting a smart pointer. At the end of the inner block, s will be deallocated, and the heap object it points to will go away because its reference count goes to 0. "q" would have been a dangling pointer. That error gets caught at compile time. There's no additional run time overhead for auto scoped objects; it's entirely a compile time check, like const.
The built-in arithmetic operations remain defined for non-auto pointers, but don't accept "auto" arguments. And conversion from non-auto to auto is defined, but auto to non-auto conversion is prohibited. Pointer arithmetic on auto scoped pointers is thus prohibited. "auto" scope allows intermixing auto and raw pointers in the same program, allowing compatibility.
This interpretation of auto is simple to implement in compilers and useful in its own right, as a way to tighten up existing smart pointer libraries. auto should have these semantics all the time. The keyword is used so seldom that this won't break much, if any, code, and if it does, a compile time error is generated.
Almost all the programming languages which postdate C++ are "memory-safe". In such languages, data objects are protected from being overstored from code which should not be able to write to them. LISP was the first language to have this property. Java, Perl, Python, and C# all have it. C and C++ do not. The usual observation is that programming is easier in memory-safe languages, primarily because debugging is much easier. But there is usually a penalty in run-time performance.
Perl, like the C/C++ family, started out as a non-object oriented language but acquired objects later in life. Perl, has a "strict mode", which turns off certain language features considered undesirable or obsolete. This idea is worth borrowing for C++. A "strict mode" for C++ offers a way to tighten up the language for new work without breaking existing code. The specific goal of "strict mode" is to eliminate, as much as possible, "undefined behavior" of programs. The goal is not stylistic. Features disabled in strict mode should be limited to those which, under the existing C++ definition, result in crash-type undefined behavior.
auto, as defined above, is an "always-on" feature. Once we have auto, we need very few additional restrictions to achieve memory safety for pointer operations:
These restrictions lock out the creation of raw pointers in strict mode. They don't lock out the use of raw pointers obtained from non-strict portions of the program. This allows interoperability of strict and non-strict code. Such mixed programs are, of course, not safe. Only programs where all compilation units are compiled in strict mode are safe. This provides a migration path to safety while allowing the reuse of existing code.
From a a safety perspective, there are three kinds of C arrays: fixed-size, constant null-terminated, and "other". The first two kinds are in principle checkable at run time. The third has to be viewed as a legacy feature not used in strict mode except to interface with existing code.
Fixed-size arrays are identifiable at compile time, and thus are in principle checkable. Such checking requires support in the compiler, but is unambiguous.
class vec3 { private: double n[3]; public: void sum(vec3& vec) { double total = 0.0; for (int i=0; i<3; i++) { total += vec[i]; } // compiler must generate subscript check return(total); } // ... } |
Note that for most loops, such checks can be optimized out.
String constants, and arrays of unknown size initialized with aggregates, present problems. The syntax of those two constructs is built into C and C++ at a low level, and both are widely used in existing code. Fortunately, most of the valid uses of those constructs involve const data items. So the following compromise is proposed.
Thus, it's possible to read junk, but not write it, and reading off the end of an array is recoverable within the program.
int main(int argc, const char* argv[]) { for (int i=0; i<argc; i++) { const char* arg = argv[i]; printf("Arg %i: %s\n",i,arg); } } |
This is classic C. Because the arrays involved are const, trouble can be contained. Non-const built-in arrays cannot be declared in strict mode. Thus, printf and fprintf are available, but sprintf and scanf, which store into strings and historically cause trouble, are not. String storage must be done through collection classes in strict mode.
This is a compromise between safety and backwards compatibility.
Almost all existing C++ programs should compile in non-strict mode. Modern C++ programs written using the STL and some smart pointer library will be convertable to strict mode without much effort. Converting older programs will consist mostly of converting them to use the STL and smart pointers, which is non-trivial but well understood.
When all the compiler errors have been eliminated, the program should be memory-safe, provided that the STL and smart pointer implementations perform appropriate checks. Requirements for template library safety are discussed separately.
This set of easily implemented restrictions makes C++ memory-safe. It retains as much of standard C/C++ semantics as can be retained consistent with safety. The overhead increase is modest provided that code is written to use auto scope pointers in speed-critical sections. Overhead can be reduced further with compiler optimization of checking.
See the sections below. This is an early draft; more will be added. Comments are welcomed, either by mail or in "comp.std.c++".
June 2, 2010