by Conrad Weisert
© 2004 Information Disciplines, Inc., Chicago
IDI's Issue of the Month, August, 2004
NOTE: This document may be circulated or quoted from freely, as long as the copyright notice is included.
In the past month conspicuous and extremely expensive system outages brought down the operation of two major businesses:
The explanation given for those major failures was the same: Unexpected problems were encountered during a "software upgrade"!
Astonishing as that may be, the reactions of the popular press and some corporate spokesmen was even more appalling:
That's just the nature of I.T. Applying a major update to a large system is always fraught with risk. Just as the media viewed the Y2K crisis as an act of God, no one is to blame. Bad luck!
Well, someone is indeed to blame for such fiascos. Updating a major system, especially while it's running, may be complicated, difficult, and even expensive, but it is certainly not unmanageable. It may also be risky, but the risks should be limited to not being able to complete the update on schedule. An unscheduled system "outage" should never be a result of a routine software version updgrade.
Update, April 21, 2007
It happened again this week! Users of BlackBerry® mobile devices were left without communication services for a half-day, reportedly as a result of inadequate testing of a software upgrade by the service provider. Given that many users worldwide have come to depend their BlackBerry devices for routine business activities, the impact was considered extremely serious.
I don't know any details of the specific causes of the Chicago Tribune and American Airlines failures. Perhaps some will come to light in the trade press. Meanwhile, it's clear that, barring deliberate sabotage, there were serious failures of management discipline.
Everyone agrees that changes to a "mission critical" system have to be subjected to the most thorough testing, and no doubt the decision makers believed that such testing had been satisfactorily completed. One wonders, however:
Unfortunately, some current operating system and database management platforms present obstacles to thorough testing. While the structured and object revolutions have been making individual programs more modular, shared dynamic libraries and global databases have been making it easier for updates to one application system to interfere with other applications.
A system upgrade or conversion must be planned as a project. Unlike most other projects, the tasks on the plan span hours or even minutes rather than days. In addition, the plan must rigorously specify:
Unexpected obstacles sometimes arise during the late stages of software installation. The team members are typically under pressure to continue and to complete a successful installation. In order to salvage the effort, which may have involved costly and highly visible preparation, the team may be strongly tempted to improvise a solution -- a "work around" or even a bit of new or modified software.
That temptation must be firmly resisted. Experience clearly shows that hasty improvisation all too often leads to worse problems.
The best way to deal with such situations is to anticipate them on the project plan and to specify well in advance what we will do if a particular situation should arise. For example:
The project plan must specify those contingencies; the project team and all levels of management must have the discipline to stick to it. It may be extremely embarrassing to abort an installation or upgrade, but once the plan is made and approved, you have no choice but to blow the whistle and take the agreed-upon action.
An essential property of the contingency plans is a well-defined back out procedure. We need to be able to restore the system to its working state prior to the start of the update or conversion.
For a mission-critical system, of course, the back-out procedures themselves must be thoroughly tested in advance.
Wherever feasible, the ability to back-out an upgrade should extend beyond the instant of cut-over. We may well discover a serious problem after the new version has been in operation for several hours or days. The ability to return to the previous working version while preserving updates the new version made to the database can be well worth the extra effort in planning.
Last updated April 22, 2007
Return to Management articles
IDI Home Page