Conrad Weisert, November 1, 2003
©2003 Information Disciplines, Inc.
This article may be freely circulated, as long as the copyright credit is included.
Andrew Koenig & Barbara Moo offered helpful advice about character-string manipulation in the August C/C++ Users Journal. They were aiming especially at experienced programmers who had gotten into the habit of using C's crude array of char rather than more modern techniques.
In two earlier articles, I offered some comments and clarifications:
char
way of representing character strings. (August Issue of the Month)
We come now to confronting the string class itself. Everyone surely agrees
that applications programming, both business and scientific, demands the
capabilities provided by one or more character-string classes. We also
know that the C++ standard library now contains a string class,
std::string. Does that library
class take care of all the needs of new and existing programs?
Unfortunately, recognition of the need and availability of
std::string
didn't occur at the same time. From
the first day C++ was unveiled, programmers recognized that the language's
class definition capability held the key to solving the long-standing
C character-string problem.
Many of those programmers went to work desigining and implementing character string classes. Those classes spanned a huge range in quality and usability. Some of them were distributed by vendors of compilers or class libraries. Others were established as standard within developer organizations. By the mid 1990s the best of them supported character-string handling comparable to that of PL/I or extended BASIC.
Of course we all agree with Koenig & Moo that robust, maintainable
programs must avoid C's array of char string
representations. But we may or may not be able to embrace
std::string
easily, if we already have an investment in
other character string classes. For some, that's just an irritating and
possibly costly conversion issue. Others, however, are finding that
std::string
doesn't support everything they need.
The std::string class supports
varying length strings with no length limit. (There is presumably some
implementation-defined maximum size, but given today's huge memory sizes
it's likely to exceed reasonable applications' requirements.)
You declare a string and then assign data to it ranging from the null
(0 length) string to an entire book.
That's equivalent to what extended BASIC supports.
Furthermore, the internal representation is not contiguous with the object. The string object contains only a pointer, which may point either to the actual character data or, more likely, to a second pointer (the so-called "reference counting" technique). That's not an implementation choice, but is dictated by limitations of the underlying C language.
But applications, especially business applicatons, also need:
Many data fields need to fit in a confined space, e.g. a mailing label or a screen form. It would make no sense, for example, to allow a 150-character string to be assigned to a cityName field, and it would complicate program logic to have to test for and adjust maximum field sizes throughout procedural code.
Other data may need to fill up a predefined space, such as a column on a report. Modern report writers and variable-space fonts have reduced but not eliminated such needs.
Many text data fields are components of a record, which is moved as a unit, especially between internal memory and a file. Having to go through cumbersome and error-prone extra steps to serialize a record on output and (worse) to reconstruct it on input is an unnatural burden. A fixed-length text field, like a numeric field, ought to be embedded in the record.
Note that many older programming languages, including PL/I and COBOL, supported just those capabilities. When I teach C++ or Java to an audience of former COBOL programmers, they're appalled, by the trouble you have to go to in order to handle what they consider the simplest and most straightforward kind of everyday data manipulation.
One explanation is that many applications view character strings as elementary data fields, while C/C++/Java programmers have come to view character strings as containers.
Now, if you don't need those capabilities and you determine that
std::string meets all your needs,
then that's the only string class you should use, and you can stop
reading here.
I'm going to describe (but not recommend for you) the character-string capabilities
we've been using for internal and client applications since the early 1990s.
We've gotten used to them over more than a decade, we like them, and we
continue to use them, even in the face of
std::string.
Dstring. |
Fstring. |
Vstring. |
Cstring. |
Dynamic (like std::string):
Assignment that changes the size
will cause memory reallocation. |
Fixed-length: Once a string is constructed it stays the same size. Assignment can truncate or pad with blanks. | Varying string: Like Dstring
except that the maximum size is specified (and allocated) upon
construction, like a PL/I varying
string. |
Constant-length: Data embedded within the object. Size must be known at compile time. |
Objects of those classes interact with each other in the expected ways.
Mixed expressions may cause implicit conversions to the most general class,
Dstring, and may slow performance.
Vstring is provided for
efficiency in situations where a program is building up a long string
by successive concatenations.
Cstring uses class templates,
e.g.
Cstring<18> cityName;We advise users to keep the number of such classes reasonable, and to limit
Cstring data to fields
within records and to internal tables.
We were guided by long years of experience with string handling in PL/I
and later in extended BASIC. The result was somewhat simpler than
std::string, which offers too
many functions with overlapping functionality.
+ operator
for concatenation. (PL/I's ||
has the wrong C precedence.) That requires the more efficient
+= operator (which, of course,
does nothing to a fixed-length left operand).
<< and, subject to options
described below, the input stream extraction operator >>.
We made heavy use of the macro preprocessor to allow an organization or a project to choose between alternative standards. For example:
| BASIC function names | PL/I function names | mid | substr |
instr | index |
len | length |
From BASIC we also took left
and right, from the C library
toUpper and
toLower. We added
trim and
reverse. To avoid confusion with
arrays, we did not implement an overloaded subscript operator.
substr(s,p,n) while those who
are accustomed to C++ or Java objects prefer s.substr(p,n).
Recognizing that programmers are hopelessly divided, we provided an installation
option macro to enable one or the other notation (or even both).
std::string
and Java's convention of 0-origin position counting, we should have made
that choice customizable, and may still do so.
As an expedient we began Dstring
with the traditional C null-terminated array. That allowed us to use the C
library routines internally. For Cstring,
however, we omit the null terminator, which would be an unwelcome
intrusion in an embedded data field.
From time to time, we consider switching to a reference counting implementation in order to gain some efficiency. We put it off, however, until such time as we encounter serious performance degradation caused by the copy constructors.
The above is not intended as a sales pitch for IDI's library string-classes, but just to show some of the issues in string class design and usage. If you already have one or more string classes that you like, you should continue to use them.
Because of the complexity of the customization options, we're not posting these classes as freeware on this web site. If you think you want them, let's discuss your needs.
Return to Technical articles
Return to C++ topics
Return to IDI home page.
Last modified November 11, 2003