That Cell has more than
one processor core on a single chip is more a
sign of the times than a revolution. All the
microprocessor stalwarts are moving to multicore design.
The principal reason is that the old way of doing
things—increasing the number of calculations per second
by shrinking the processors into a tighter knot of
tinier transistors and then dialing up the clock
speedĀhas essentially crashed headlong into the brick
wall of heat generation.
Because transistors using today's technology are so
small, even when they are supposed to be in the "off"
state, infinitesimal currents still leak through them.
That leakage warms them constantly, and with the extra
heat generated when transistors switch "on" or "off," it
produces a microfurnace on a chip. If chip makers had
continued on their old path, by the year 2015,
microprocessors would be throwing off more watts per
square millimeter than the surface of the sun.
As a result, the industry has shifted from maximizing
performance to maximizing performance per watt, mainly
by putting more than one microprocessor on a single chip
and running them all well below their top speed. Because
the transistors are switching less frequently, the
processors generate less heat. And because there are at
least two hot spots on each chip, the heat is spread
more evenly over it, so it's less damaging to the
circuitry and easier to get rid of with fans and heat
sinks.
IMAGE: IBM CORP.
|
CELL CITY MAP: The Cell microprocessor that will power Sony's
PlayStation 3 game console has nine processor
cores. The core making up the left quarter of
the chip is similar to the processors in Apple
computers. The other eight cores, notable by
their columns of memory [brown], are designed to
do multimedia tasks.
|
Multicore processors on the market today are
generally symmetrical—that is, they have two copies of
essentially the same core on one chip. Cell, on the
other hand, has an asymmetric architecture that contains
two different kinds of cores [see photo, "Cell City
Map"]. One, the Power processing element, is similar to
the CPU in a Mac; it runs the Linux operating system and
divides up work for the other eight processors to do.
Those eight—called Synergistic processing elements—are
designed specifically to juggle multimedia applications:
video compression and decompression, encryption and
decryption of copyrighted content, and, especially,
rendering and modifying graphics.
The Synergistic elements were built from the ground
up to do what are called single-precision floating-point
calculations—the kind of operations needed for dazzling
three-dimensional graphics and a host of other
multimedia tasks. The design traded flexibility—a
Synergistic element is not versatile enough to run the
Linux operating system on its own—for eye-popping
speed. When pushed to its 5.6-gigahertz limits, a single
unit can do 44.8 billion single-precision floating-point
calculations per second. Not wanting to cut Cell off
from a role in scientific computing, its designers
included circuitry in each Synergistic element that can
do the more exacting calculations, called
double-precision, that scientists demand, but its
performance is only about one-tenth that of the
single-precision unit.
In fact, the Synergistic elements are so fast that a
single one could easily consume the entire bandwidth on
the interconnects to the off-chip memory, leaving its
siblings starved for data and stalled out. IBM and its
partners had to design a special chunk of circuitry into
Cell just to prevent that problem.
Apart from its raw power, Cell has content-protection
tricks that should make it attractive to multimedia
applications makers. For instance, the Synergistic
element's architecture prevents any application or
external device from accessing the element's local
memory, so that, for instance, a program cannot steal a
music file that is being decrypted by the processor.
"Once you bring your code in and decrypt it, it can
execute in a virtually trusted environment," says IBM's
Cell architect Charles R. Johns. "All the data it
calculates on, sends out, and brings in is fully
protected."
The isolation function can be used in several ways,
says Kahle. "We knew we couldn't anticipate all the
different security needs in the future, but we wanted to
know we had the right hardware to support a very robust
security system."
Barry Minor's Mount
Saint Helens simulator is a good example
of how Cell's different processors work together. His
program takes a satellite photo of the volcano, lines it
up with an elevation map, and then turns it into a
detailed 3-D terrain on the fly. The Mount Saint Helen's
data has a resolution of 2.4 meters. The city of Austin,
where the Cell design center is, once gave Minor access
to its 15.4-centimeter-resolution satellite map. "You
could land in Michael Dell's backyard and check out his
view," Minor says with a grin.
What's happening inside the processor is a finely
choreographed dance. The Power processing element starts
by figuring out where the joystick is pointing the
simulator in the stored 2-D maps. Then it divides that
scene into 32 portions, four for each Synergistic
element. Though perfectly capable of it, the Power
processing element does no calculations on the actual
data. Instead, it plays to its strength as a controller,
figuring out which chunk of work should go to each of
the other cores according to how complex the scene is
and which cores have more or less time on their hands.
The Synergistic elements then go to work. They pull
their portion of the data into their local memories,
which they can access at great speed. Then each runs a
rendering algorithm on the data and stores it off the
chip in the system memory. When the processors are done,
they signal the Power element, which instructs one of
the synergistic units to run a video compression
algorithm. That processor compresses its sister units'
finished products and then pushes them out to be
displayed on the screen or streamed to a PDA or some
other device.
Because the compression takes less time than
rendering the graphics, the compressing processor
automatically switches gears when it's finished and runs
the rendering algorithm on a portion of data until it's
needed for compression again. With each frame, the
process starts over.
This dance works so well for two reasons. The first
has to do with the way Cell handles memory. Rather than
waste several clock cycles waiting for the right data to
arrive from memory, a Synergistic element works only on
data stored in its own 256 kilobytes of memory, to which
it has a high-bandwidth connection. More important,
Cell's memory-handling engines can be programmed to keep
data streaming through the processor. "We can get over
128 memory transactions going in flight at once," boasts
Michael N. Day, a distinguished engineer at IBM.
The memory-access engine takes in new data and sends
out the old just in time for the synergistic unit to
perform the necessary calculations. When Cell runs
Minor's volcano simulator, it waits for data to arrive
from memory for only 1 percent of the time; the G5, in
contrast, stands idle for about 40 percent of the time.
Cell's other key to speed has to do with breaking
problems into parts that can be done in parallel. In
Minor's simulation, it probably seems obvious that an
image can be divided up into eight strips and these
worked on independently. What wasn't so obvious was that
the 3-D rendering could be done four pieces of data at a
time within each synergistic processor. Such four-way
parallel computing is called single instruction multiple
data, or SIMD, and it is particularly well suited to the
manipulation of graphics and other multimedia.
In these problems, you typically want to perform the
same operation on each of the elements in a large chunk
of data. For example, to increase the brightness of an
image, you'd want to add the same number to every pixel
in it. Since around the mid-1990s, general-purpose
processors such as the Intel x86 architectures have been
doing SIMD computing using a set of multimedia-specific
instructions, explains Princeton's Lee, a multimedia
instructions pioneer.
But SIMD instructions run far faster on Cell's
Synergistic processors, because the Cell processors were
designed from the start to handle them. And don't
forget: there are eight such processors on each chip.
Cell programmers spend most of their time turning
complex algorithms into efficient SIMD algorithms, says
Minor. "Once you've done that, you're 80 percent done."