After crash
investigators consider the weather as a
factor in a plane crash, they look at the airplane
itself. Was there something in the plane's design that
caused the crash? Was it carrying too much weight?
In IT project failures, similar questions invariably
come up regarding the project's technical components:
the hardware and software used to develop the system and
the development practices themselves. Organizations are
often seduced by the siren song of the technological
imperative—the uncontrollable urge to use the latest
technology in hopes of gaining a competitive edge. With
technology changing fast and promising fantastic new
capabilities, it is easy to succumb. But using immature
or untested technology is a sure route to failure.
In 1997, after spending $40 million, the state of
Washington shut down an IT project that would have
processed driver's licenses and vehicle registrations.
Motor vehicle officials admitted that they got caught up
in chasing technology instead of concentrating on
implementing a system that met their requirements. The
IT debacle that brought down FoxMeyer Drug a year
earlier also stemmed from adopting a state-of-the-art
resource-planning system and then pushing it beyond what
it could feasibly do.
A project's sheer size is a fountainhead of failure.
Studies indicate that large-scale projects fail three to
five times more often than small ones. The larger the
project, the more complexity there is in both its static
elements (the discrete pieces of software, hardware, and
so on) and its dynamic elements (the couplings and
interactions among hardware, software, and users;
connections to other systems; and so on). Greater
complexity increases the possibility of errors, because
no one really understands all the interacting parts of
the whole or has the ability to test them.
Sobering but true: it's impossible to thoroughly test
an IT system of any real size. Roger S. Pressman pointed
out in his book Software Engineering, one of the classic
texts in the field, that "exhaustive testing presents
certain logistical problems....Even a small 100-line
program with some nested paths and a single loop
executing less than twenty times may require 10 to the
power of 14 possible paths to be executed." To test all
of those 100 trillion paths, he noted, assuming each
could be evaluated in a millisecond, would take 3170
years.
All IT systems are intrinsically fragile. In a large
brick building, you'd have to remove hundreds of
strategically placed bricks to make a wall collapse. But
in a 100 000-line software program, it takes only one or
two bad lines to produce major problems. In 1991, a
portion of ATandamp;T's telephone network went out,
leaving 12 million subscribers without service, all
because of a single mistyped character in one line of
code.
Sloppy development practices are a rich source of
failure, and they can cause errors at any stage of an IT
project. To help organizations assess their
software-development practices, the U.S. Software
Engineering Institute, in Pittsburgh, created the
Capability Maturity Model, or CMM. It rates a company's
practices against five levels of increasing maturity.
Level 1 means the organization is using ad hoc and
possibly chaotic development practices. Level 3 means
the company has characterized its practices and now
understands them. Level 5 means the organization
quantitatively understands the variations in the
processes and practices it applies.
As of January, nearly 2000 government and commercial
organizations had voluntarily reported CMM levels. Over
half acknowledged being at either level 1 or 2, 30
percent were at level 3, and only 17 percent had reached
level 4 or 5. The percentages are even more dismal when
you realize that this is a self-selected group;
obviously, companies with the worst IT practices won't
subject themselves to a CMM evaluation. (The CMM is
being superseded by the CMM-Integration, which aims for
a broader assessment of an organization's ability to
create software-intensive systems.)
Immature IT practices doomed the U.S. Internal
Revenue Service's $4 billion modernization effort in
1997, and they have continued to plague the IRS's
current $8 billion modernization. It may just be
intrinsically impossible to translate the tax code into
software code—tax law is complex and based on
often-vague legislation, and it changes all the time.
From an IT developer's standpoint, it's a requirements
nightmare. But the IRS hasn't been helped by open
hostility between in-house and outside programmers, a
laughable underestimation of the work involved, and many
other bad practices.
THE PILOT'S ACTIONS JUST
BEFORE a plane crashes are always of great
interest to investigators. That's because the pilot is
the ultimate decision-maker, responsible for the safe
operation of the craft. Similarly, project managers play
a crucial role in software projects and can be a major
source of errors that lead to failure.
Back in 1986, the London Stock Exchange decided to
automate its system for settling stock transactions.
Seven years later, after spending $600 million, it
scrapped the Taurus system's development, not only
because the design was excessively complex and
cumbersome but also because the management of the
project was, to use the word of one of its own senior
managers, "delusional." As investigations revealed, no
one seemed to want to know the true status of the
project, even as more and more problems appeared,
deadlines were missed, and costs soared [see box,
""].
The most important function of the IT project manager
is to allocate resources to various activities. Beyond
that, the project manager is responsible for project
planning and estimation, control, organization, contract
management, quality management, risk management,
communications, and human resource management.
Bad decisions by project managers are probably the
single greatest cause of software failures today. Poor
technical management, by contrast, can lead to technical
errors, but those can generally be isolated and fixed.
However, a bad project management decision—such as
hiring too few programmers or picking the wrong type of
contract—can wreak havoc. For example, the developers
of the doomed travel reservation system claim that they
were hobbled in part by the use of a fixed-price
contract. Such a contract assumes that the work will be
routine; the reservation system turned out to be
anything but.
Project management decisions are often tricky
precisely because they involve tradeoffs based on fuzzy
or incomplete knowledge. Estimating how much an IT
project will cost and how long it will take is as much
art as science. The larger or more novel the project,
the less accurate the estimates. It's a running joke in
the industry that IT project estimates are at best
within 25 percent of their true value 75 percent of the
time.
There are other ways that poor project management can
hasten a software project's demise. A study by the
Project Management Institute, in Newton Square, Pa.,
showed that risk management is the least practiced of
all project management disciplines across all industry
sectors, and nowhere is it more infrequently applied
than in the IT industry. Without effective risk
management, software developers have little insight into
what may go wrong, why it may go wrong, and what can be
done to eliminate or mitigate the risks. Nor is there a
way to determine what risks are acceptable, in turn
making project decisions regarding tradeoffs almost
impossible.
Poor project management takes many other forms,
including bad communication, which creates an
inhospitable atmosphere that increases turnover; not
investing in staff training; and not reviewing the
project's progress at regular intervals. Any of these
can help derail a software project.
The last area that
investigators look into after a plane crash
is the organizational environment. Does the airline have
a strong safety culture, or does it emphasize meeting
the flight schedule above all? In IT projects, an
organization that values openness, honesty,
communication, and collaboration is more apt to find and
resolve mistakes early enough that rework doesn't become
overwhelming.
If there's a theme that runs through the tortured
history of bad software, it's a failure to confront
reality. On numerous occasions, the U.S. Department of
Justice's inspector general, an outside panel of
experts, and others told the head of the FBI that the
VCF system was impossible as defined, and yet the
project continued anyway. The same attitudes existed
among those responsible for the travel reservation
system, the London Stock Exchange's Taurus system, and
the FAA's air-traffic-control project—all indicative of
organizational cultures driven by fear and arrogance.
A recent report by the National Audit Office in the
UK found numerous cases of government IT projects' being
recommended not to go forward yet continuing anyway. The
UK even has a government department charged with
preventing IT failures, but as the report noted, more
than half of the agencies the department oversees
routinely ignore its advice. I call this type of
behavior irrational project escalation—the inability to
stop a project even after it's obvious that the
likelihood of success is rapidly approaching zero.
Sadly, such behavior is in no way unique.
In the final
analysis, big software failures tend to
resemble the worst conceivable airplane crash, where the
pilot was inexperienced but exceedingly rash, flew into
an ice storm in an untested aircraft, and worked for an
airline that gave lip service to safety while cutting
back on training and maintenance. If you read the
investigator's report afterward, you'd be shaking your
head and asking, "Wasn't such a crash inevitable?"
So, too, the reasons that software projects fail are
well known and have been amply documented in countless
articles, reports, and books [see sidebar, ]. And
yet, failures, near-failures, and plain old bad software
continue to plague us, while practices known to avert
mistakes are shunned. It would appear that getting
quality software on time and within budget is not an
urgent priority at most organizations.
It didn't seem to be at Oxford Health Plans Inc., in
Trumbull, Conn., in 1997. The company's automated
billing system was vital to its bottom line, and yet
senior managers there were more interested in expanding
Oxford's business than in ensuring that its billing
system could meet its current needs [see box,
""]. Even as problems arose, such as
invoices' being sent out months late, managers paid
little attention. When the billing system effectively
collapsed, the company lost tens of millions of dollars,
and its stock dropped from $68 to $26 per share in one
day, wiping out $3.4 billion in corporate value.
Shareholders brought lawsuits, and several government
agencies investigated the company, which was eventually
fined $3 million for regulatory violations.
Even organizations that get burned by bad software
experiences seem unable or unwilling to learn from their
mistakes. In a 2000 report, the U.S. Defense Science
Board, an advisory body to the Department of Defense,
noted that various studies commissioned by the DOD had
made 134 recommendations for improving its software
development, but only 21 of those recommendations had
been acted on. The other 113 were still valid, the board
noted, but were being ignored, even as the DOD
complained about the poor state of defense software
development!
Some organizations do care about software quality, as
the experience of the software development firm Praxis
High Integrity Systems, in Bath, England, proves. Praxis
demands that its customers be committed to the project,
not only financially, but as active participants in the
IT system's creation. The company also spends a
tremendous amount of time understanding and defining the
customer's requirements, and it challenges customers to
explain what they want and why. Before a single line of
code is written, both the customer and Praxis agree on
what is desired, what is feasible, and what risks are
involved, given the available resources.
After that, Praxis applies a rigorous development
approach that limits the number of errors. One of the
great advantages of this model is that it filters out
the many would-be clients unwilling to accept the
responsibility of articulating their IT requirements and
spending the time and money to implement them properly.
[See "The
Exterminators," in this
issue.]
Some level of software
failure will always be with us. Indeed, we
need true failures—as opposed to avoidable blunders—to
keep making technical and economic progress. But too
many of the failures that occur today are avoidable. And
as our society comes to rely on IT systems that are ever
larger, more integrated, and more expensive, the cost of
failure may become disastrously high.
Even now, it's possible to take bets on where the
next great software debacle will occur. One of my
leading candidates is the IT systems that will result
from the U.S. government's American Health Information
Community, a public-private collaboration that seeks to
define data standards for electronic medical records.
The idea is that once standards are defined, IT systems
will be built to let medical professionals across the
country enter patient records digitally, giving doctors,
hospitals, insurers, and other health-care specialists
instant access to a patient's complete medical history.
Health-care experts believe such a system of systems
will improve patient care, cut costs by an estimated $78
billion per year, and reduce medical errors, saving tens
of thousands of lives.
But this approach is a mere pipe dream if software
practices and failure rates remain as they are today.
Even by the most optimistic estimates, to create an
electronic medical record system will require 10 years
of effort, $320 billion in development costs, and $20
billion per year in operating expenses—assuming that
there are no failures, overruns, schedule slips,
security issues, or shoddy software. This is hardly a
realistic scenario, especially because most IT experts
consider the medical community to be the least
computer-savvy of all professional enterprises.
Patients and taxpayers will ultimately pay the price
for the development, or the failure, of boondoggles like
this. Given today's IT practices, failure is a distinct
possibility, and it would be a loss of unprecedented
magnitude. But then, countries throughout the world are
contemplating or already at work on many initiatives of
similar size and impact—in aviation, national security,
and the military, among other arenas.
Like electricity, water, transportation, and other
critical parts of our infrastructure, IT is fast
becoming intrinsic to our daily existence. In a few
decades, a large-scale IT failure will become more than
just an expensive inconvenience: it will put our way of
life at risk. In the absence of the kind of industrywide
changes that will mitigate software failures, how much
of our future are we willing to gamble on these
enormously costly and complex systems?
We already know how to do software well. It may
finally be time to act on what we know.