by Conrad Weisert
November 2, 2013
©2013 Information Disciplines, Inc.
I have to confront the character-string ordering issue twice this month:
I have to offer some useful advice to both groups about matching and ordering commonly-occurring data items such as names of people. We have to sort lists of names in order to produce directories. Let's review some of the subtleties we have to cater to.
I sometimes provide to students a list of people's names and ask them to write a program to produce
a sorted list. If the list includes
de la Renta,Oscar and
De Gaulle,Charles they may be surprised to discover
Oscar de la Renta at the end of their sorted results. Once they understand character encoding,
they know that lower-case characters come after upper-case, and they start thinking of ways
to do a case-independent comparison.
At worst, they may have to invoke case conversion methods:
if (s1.toUpper() > s2.toUpper()) swap(s1,s2);
or they may be using a string class that provides a case-independent comparison option.
A similar character-conversion strategy can be used to compare some strings containing characters that carry accent marks, such as é, and ü. In most European languages those characters collate with their unaccented equivalents, but there are exceptions. In Swedish, for example, ö is a distinct character at the end of alphabetic sequence while in German the same graphic just collates with the letter o. Norwegians and Danes avoid that confusion by using the graphic ø instead of the Swedish version, but then disputes may arise in designing general pan-Scandinavian lists.
When we learn German, we accept the equivalence of
one character position equivalent to two. When should an application observe that equivalence, and
when it is all right to ignore it? Similar choices exist in other languages; for example, some
Norwegians have catered to international alphabets by changing å to aa, but
a Norwegian telephone directory may list them together.
If you're designing an application or a data base that's intended to serve customers, members, vendors, or others in multiple cultures, you'll need to document thoroughly the support your application will give to various alphabets and you'll have to obtain agreement in advance from stakeholders.
On the other hand, if you're developing an application to be based in a single culture and expect it to draw its customers, members, or other user mainly from that culture, you just need to support the kinds of information your users will use and expect. At the very minimum it should support lower-case letters in their dictionary sequence and those single-character accented letters that we expect to collate with their unaccented equivalents. You may have to code custom comparison operators or you may find what you need in a standard component library.
Return to Technical articles
IDI Home page
Last modified November 2, 2013