Photo: NASA
|
4 October 2007—Aboard the International Space
Station, the three Russian computers that control the
station's orientation have been happily humming away now
for several weeks. And that's proof that the crisis in
June that crippled the ISS and bloodied the U.S.-Russian
partnership that supports it, has been solved.
But the technological—and diplomatic—lessons of that
crisis need to be fully understood and appreciated.
Because if the failure had occurred on the way to Mars,
say, it probably would have been fatal, and it will
likely be the same international partnership that builds
the hardware for a future Mars mission.
The critical computer systems, it turned out, had been
designed, built, and operated incorrectly—and the
failure was inevitable. Only being so relatively close
to Earth, in range of resupply and support missions,
saved the spacecraft from catastrophe.
During the first days of the computer failure in June,
the station's atmosphere control system seized up. The
failure also knocked out the autopilot's ability to fire
maneuvering thrusters to hold the station steady during
the undocking of the space shuttle, which had arrived on
10 June. The terse description in the NASA internal
technical report on the crisis, obtained by IEEE
Spectrum, put it this way: “On 13 June, a complete
shutdown of secondary power to all [three] central
computer and terminal computer channels occurred,
resulting in the loss of capability to control ISS
Russian segment systems.”
Russian officials were quick to blame NASA for
“zapping their computers” with “dirty” 28-volt power
from a newly installed solar power wing. Another Russian
explanation was that the expanded station structure (the
main purpose of the shuttle visit) might be excessively
charging up due to its orbital speed through Earth's
magnetic field. These were the first of many bad guesses
by top Russian program managers that would distract
engineers trying to get at the real problem.
The initial assumption was that some external
interference, such as noise on the power supply, was
responsible for generating false commands inside the
computer system. On the assumption that the bad commands
were coming from inside a power-monitoring device, the
crew bypassed it on two of the three downed computers,
using jumper cables. By the time the shuttle undocked on
19 June, the computers began to function normally—or so
it seemed. Replacement parts were quickly manifested on
a robot supply ship, while ground engineers wrestled
with the fundamental question of cause and effect.
Analysis teams still had to determine why the
computers failed, and why the jumper cables seemed to
fix the problem. More important, they needed to know
whether the problem really was fixed, or whether
something could again trigger the systemwide crash of
the supposedly triply redundant architecture.
In the weeks that followed the crisis and apparent
recovery, station commander Fyodor Yurchikhin and his
fellow cosmonaut Oleg Kotov disassembled the boxes and
cabling and inspected every angle of the hardware,
occasionally assisted by their American crewmate,
Clayton Anderson. Multiple scopes and probes had failed
to find the flaw, but their eyes and fingers eventually did.
The connection pins from the power-monitoring device
they'd bypassed earlier, they found, were wet—and
corroded. The final report described the “change in
appearance” of fasteners on one box's connectors and
noted “the presence of deposits and residue on the
housings, and residue and spots on the contact surfaces.”
Continuity checks found that specific wires, called
command lines, in the cable coming out of the device had
failed. And one of those lines had short-circuited.
Also, in a shocking design flaw, there was a “power off”
command leading to all three of the supposedly redundant
processing units. The line was designed to protect the
main computers, which are downstream of the power
monitor, from power glitches too great for normal power
filters to protect against. It does so by turning the
computers off when it senses trouble. But in a failure
unanticipated by its designers, this one command path
itself was able to kill all three processing units due
to a single corrosion-induced short.
That discovery was a great relief to spacecraft
controllers in Houston and Moscow. The bypass jumper
cables were exactly what really was needed to circumvent
the false “power off” command, because they forced that
command line to remain dormant. Using the cables did
expose the computers to damage from real power surges,
but by then the power system had settled into a benign
and steady state.
But what caused the corrosion? The source was quickly
identified: water condensation, one of the most frequent
culprits in avionics problems. The NASA report says the
damage “presumably” was “the result of repeated
emissions of condensate from the air separation lines”
of a nearby dehumidifier. Air flow and power usage were
supposed to keep the computer cables warm enough to
prevent water from condensing on them, but the
dehumidifier had been malfunctioning, and its frequent
on-off cycles led to surges of water vapor. Also, a
stream of cold air from another location on the
dehumidifier helped drive the cable temperatures
occasionally below the dew point.
During the August shuttle visit, the Russians were
able to turn stabilization control over to the American
spaceship and tear down their old computer network. The
boxes and cables were replaced with fresh units, built
and supplied by the European Space Agency and sent up
inside a recently launched robot supply ship.
“Upon removal of the old unit, the crew reported that
there was cold condensate behind it,” notes an internal
NASA ISS status report for 12 August obtained by IEEE
Spectrum. “Drops of humidity and mold were discovered.
The unit itself is humid.”
To add to their headaches, the cosmonauts discovered
that one of the new cables was about 40 centimeters
shorter than the one it was supposed to replace—and it
wouldn't reach. After careful visual inspection of the
original cable , the cosmonauts decided there were no
signs of corrosion, so there was no need to replace it.
They also decided to rig a thermal barrier out of a
surplus reference book and all-purpose gray tape. As a
last step, they removed the jumper cables, verified the
system was functional, and closed the access panels.
It is dismaying that after decades of experience with
manned space stations, Russian space engineers still
couldn't keep unwanted condensation at bay. But what's
worse is that they designed circuitry that would allow
one spot of corrosion to fell a supposedly triply
redundant control computer complex. Another cause for
dismay is that when trouble did develop, the Russians'
first instinct was to blame their American partners.
Such deficiencies need to be worked out in the years
ahead, on the space station, before both the technology
and the diplomacy can be thought reliable enough for
far-ranging missions that replacement shipments wouldn't
be able to reach.