©Conrad Weisert, January, 2010
Experienced programmers learning their first C-family language (C, C++, Java, C#)
express shock and dismay when they learn that constant-length character strings
within a record (struct or object) are not
contiguous with the record. They're advised
that only a pointer to the characters is a member of the record. The actual
character-string data will be in an unnamed area of heap memory accessible only through that
pointer. It follows that:
sizeof operator doesn't report
the actual dimension of the whole data record.
The programmers may point out that the languages they've used before (Cobol, PL/1, etc.) imposed no such complications. A character-string data item was simply a member of the record.
The origin of this complication lies in the original design of the C programming language in the 1970s. The language was intended mainly for systems programming1 as a substitute for machine-dependent assembly languages. Its focus was on machine words: integers, addresses, floating-point numbers. and sometimes individual bits. The designers2 believed they were keeping the language simple by not providing a character-string data type. Instead, on those occasions when the programmer had to deal with a character string, he or she could simulate it through an array of single characters (actually 8-bit integers). That was what an assembly-language programmer would do.
That expedient was further complicated by another expedient. In order to make array access efficient, subscripting was just a notational convenience for address arithmetic. The origin of an array was a pointer, which the programmer could increment. A program's access to array elements, including characters within a string, had to be through pointers.
That led to the notion that a character string is not an elementary data item but a container! Even now, four decades later, many C++ programmers will tell you that character strings are containers and have to be handled that way.
Fortunately, C++ provides a handy way to handle strings as contiguous elementary
data. When we first learn about class templates we see examples in which the
classes are containers and the template parameters are the names of types
or classes: template<type T> class Thing { . .
Courses and textbooks often overlook another form of class template, where the template parameter is
an integer: template<int size> class Cstring {
{char data[size];
. . .
};
That template parameter will be specified as a constant whenever a client programmer
declares a data item: class Product {
Cstring<8> identifierCode;
Cstring<48> description;
Money price;
int onHandQ;
int onOrderQ;
};
How big is a Product object, assuming that a
Money object occupies 8 bytes,
an int is 4 bytes, and a
char is a single byte? What will
sizeof(Product) return? What does a programmer have
to do to store a Product object in a database and retrieve it later? How would all that change if we used the STL's
string class instead of
Cstring?
Obviously, we need to provide functionality to go with the data. We won't show the details in this article. In addition to the usual constructors, operators, and substring methods, we'll need methods to convert between this class and other string classes. To keep the representation pure and compatible with other languages let's not append that null-termination character that characterizes pseudo-strings in C.
Note that there is zero overhead in this class. The string object contains no pointer, no length field, no terminator, no reference count—just the value.
Unfortunately the compiler will produce a separate class with its full range of
methods for every value of size that a client
programmer uses to instantiate objects. To avoid burdening the program with dozens
of Cstring classes organizations
establish conventions to limit the options, e.g. only multiples of 8.
Most uses of Cstring will themselves be
members of classes, as in the Product example
above. rather than in inline application code. That helps to control the proliferation
of separate Cstring classes.
For text manipulation, the STL's string
class3 provides far more power and flexibility than
our little contiguous string class. Cstring
is recommended mainly for fields within a data record that:
If readers are interested, we'll show the code in a later article.
Return to home page
technical articles
C++ articles