Extremely common data type rarely defined right . . .

The troublesome Person class

Conrad Weisert, July 1, 2012

Background:

Nearly every textbook or introductory article about objects offers Person as an example of a class that many applicatons need. Unfortunately, many of them get it wrong.

Among the common failings of sample Person classes are:

These issues are difficult and subtle, but it's worth extra effort to avoid burdening applications with inconsistent and inflexible interpretations of such an important data type.

Is a Person class possible? practical?

Recently one young colleague reacted to my request for help on this problem by asserting that a general, reusable Person class is impossible:

"I cannot imagine a single Person class that covers any one, let alone two of these [application areas]."

If his judgment were valid, then the object paradigm would fail to deliver one of its presumed major advantages: inheritance or the so-called is-a hierarchy. The Person base class needs to specify only what's common to all Person objects. We can then derive whatever specialized subclasses we need for particular application areas.

If that's impossible, then what's the use of class hierarchies?

Confusion with PersonName

A respected textbook contains this Lisp macro example:

  (defstruct person :first-name :last-name)
while a web-site advice forum proposes this Java definition:
    public class Person {
      String firstName;
      String lastName;
     
      public Person(String firstName, String lastName)
       {this.firstName = firstName;
        this.lastName  = lastName;
       }
      .
      .
   }
Those examples tell us nothing about the person represented by an object of the class, except for his or her name in a rather specialized, error-prone, and inflexible form. We might try to correct the problem by renaming this class PersonName and incorporating it into a new Person class:
 public class Person {
     PersonName name;
       other attributes
               
     public  Person(PersonName name, . . . )  // Constructor
               {this.name = name;
                    .
                    . 
               }	
                

     public PersonName getName()              //  Accessor
               {return name;}
           .
           .
   }
but our PersonName class remains badly flawed. It can't handle either a person with more than two names (George Herbert Walker Bush) or a person with a non-European name (Chiang Kai-Shek). It will accept character strings that are much longer than will fit on a standard mailing label or a resonable report. It will also accept names containing illegal characters, even non-printing control characters that could confuse subsequent processing or printing.

These problems are easily corrected, but they should be taken care of in the PersonName class, so we can proceed with defining and implementing our Person class.

Confusion with role

We often see role specifications implemented as an is-a hierarchy with Person at the root. Common English usage makes this seem plausible, since we know that:

and so on. What's wrong with that? Quite a lot from an OOP point of view.

For one thing, we can't say which Person attributes should be public: A customer's age is none of a vendor's business, while a patient's age is essential information for a hospital. Attempts to override public accessors with private declarations violate the Liskov substitution principle and impose complications on class hierarchies.

After careful thought we realize that what we informally call an is-a relationship is better expressed as is performed by or is assigned to. Thus the Employee role is assigned to Mary Ferguson or the Customer role is performed by Joe Miller. To express this in OOP, we use a has a relationship, even though it's not exactly idiomatic English. An Employee object has a Person member. When a program instantiates an Employee object it assigns the specified Person to the Employee role.

The designer/implementer of the Employee class is free to determine whether the whole Person object should be accessible and, if not, which Person public attributes should be available through accessor functions.

Confusion between permanent and volatile attributes

A surprising number of Person class examples in textbooks and on the Internet specify age as an int member data item and a corresponding int constructor parameter as well as an age()1 accessor function returning int! Of course any programmer beyond beginner level knows that perishable data items don't belong in objects that may well be stored in a data base. He or she would almost automatically substitute dateOfBirth.

But most issues of data volatility are more subtle. Certain attributes of a Person never change (or are extremely unlikely to be changed, except to correct an error):

while others may be changed, more or less easily:

Some experts prefer to separate those categories into separate classes, one of which is immutable, while others prefer to integrate them in a single Person class. In either case it must be possible for a user to make appropriate changes, even if only by invoking the constructor to create a fresh object. Some applications will require automatic audit-trail logging.

Note that attributes specific to a particular role do not belong to the general Person class. If we find a Person class with a salary or gradePointAverage attribute, we can infer that its designer didn't understand sound OOP concepts.

Three inheritance hierarchies

It's obvious that natural hierarchies exist among roles. There are different kinds of Employee (manager, part-time, etc.), different kinds of Student (undergraduate, non-credit, etc.), and so on.

There's also a hierarchy among PersonName. Chinese nanes are constructed differently from European names, and many applications have to deal sensibly with both.

But we must also recognize the natural and very simple hierarchy among Person objects themselves:

  1. RealPerson
    1. LivingPerson
    2. DeceasedPerson
  2. FictionalPerson
    1. LiteraryCharacter
    2. MythologicalFigure

We may prefer to represent some of those distinctions either by an inheritance hierarchy or, at the cost of wasted space for irrelevant attributes, by a type code in a Person object.

In particular, it seems sensible to combine LivingPerson and DeceasedPerson into a common RealPerson, using a null or zero value for the dateOfDeath member to make the distinction. It must be possible for an object to make the transition from LivingPerson to DeceasedPerson without requiring changes to every user program (but perhaps with a facility for notifying user programs).

The unique identifier problem

This has been a major obstacle to defining a widely reusable Person class.

What can we do about that?

We mustn't give up or consider this problem insoluble. We often read in our newspapers about people who have been pursued by collection agencies or even arrested and imprisoned because of identity confusion. The problem will not cure itself.

Three identifiers

To identify an individual we may need three identifier fields. These two are members of the Person object.

  1. An original (possibly interim) identifier. This let's us store the Person record in a database even when the permanent identifier is unknown or hasn't been assigned yet. It needs to be unique only within the application that owns the database.

  2. A permanent (or current) identifier, such as a U.S. Social Security number. It must be globally unique.
Application logic could determine whether and when to discard the original interim identifier.

In addition the various Role classes would have their own identifiers, such as employeeNumber. They would serve as primary keys in application-specific databases.

A base class

Almost any application can use and possibly extend this base class:

  class
    RealPerson {
      PersonID   interimID;
      PersonID   currentID;
      PersonName name;
      Sex        gender;       
      Date       dateOfBirth;  //  January 1, 0001 if unspecified 
      Date       dateOfDeath;  //  less than dateOfBirth, if living
       .
       .     
   }
We leave for representation in databases We'll provide the usual methods including (at least) read accessors for all the member data and write (set) accessors for currentID and dateOfDeath. An accessor for age could eliminate the need for user programs to perform Date calculation.

Some programmers regard a Sex class as overkill, and prefer a simple boolean or single character representation of gender, but even then it should be a standard within an organization.

Following a top-down design approach, we now must design the PersonName and PersonID classes.

Unsolved problem:   How do we implement the equality relational operator? What if the currentIDs are unspecified and everthing else is identical except the interimIDs? Can a relational operator return maybe?

Note that the default (unspecified) date lies outside the Gregorian calendar range. That's all right as long as a user program doesn't have to handle exact dates in antiquity.

To be continued . . .

In this article we've looked at some of the difficulties in designing classes to support the concept of Person. Later we'll propose one or more specific solutions. If you'd like to share ideas, clarifications, disagreements, or actual code, please do so.

Conrad Weisert
cweisert@acm.org


1—or getAge(); see You can get some of the data all of the time

Last modified 20 April 2013

Return to technical articles .
Return to IDI home page.