The world's leading source of technology news and analysis
Search Spectrum IEEEXplore Digital Library Submit
Font Size: A A A
IEEE
Home [Alt + 1] Magazine [Alt + 2] Bioengineering [Alt + 3] Computing [Alt + 4] Consumer [Alt + 5] Power/Energy [Alt + 6] Semiconductors [Alt + 7] Communications [Alt + 8] Transportation [Alt + 9]

Winner: Multimedia Monster Continued By Samuel K. Moore

First Published January 2006
emailEmail PrintPrint CommentsComments ()  ReprintsReprints NewslettersNewsletters

That Cell has more than one processor core on a single chip is more a sign of the times than a revolution. All the microprocessor stalwarts are moving to multicore design. The principal reason is that the old way of doing things—increasing the number of calculations per second by shrinking the processors into a tighter knot of tinier transistors and then dialing up the clock speedĀ­has essentially crashed headlong into the brick wall of heat generation.

Because transistors using today's technology are so small, even when they are supposed to be in the "off" state, infinitesimal currents still leak through them. That leakage warms them constantly, and with the extra heat generated when transistors switch "on" or "off," it produces a microfurnace on a chip. If chip makers had continued on their old path, by the year 2015, microprocessors would be throwing off more watts per square millimeter than the surface of the sun.

As a result, the industry has shifted from maximizing performance to maximizing performance per watt, mainly by putting more than one microprocessor on a single chip and running them all well below their top speed. Because the transistors are switching less frequently, the processors generate less heat. And because there are at least two hot spots on each chip, the heat is spread more evenly over it, so it's less damaging to the circuitry and easier to get rid of with fans and heat sinks.

IMAGE: IBM CORP.

CELL CITY MAP: The Cell microprocessor that will power Sony's PlayStation 3 game console has nine processor cores. The core making up the left quarter of the chip is similar to the processors in Apple computers. The other eight cores, notable by their columns of memory [brown], are designed to do multimedia tasks.

Multicore processors on the market today are generally symmetrical—that is, they have two copies of essentially the same core on one chip. Cell, on the other hand, has an asymmetric architecture that contains two different kinds of cores [see photo, "Cell City Map"]. One, the Power processing element, is similar to the CPU in a Mac; it runs the Linux operating system and divides up work for the other eight processors to do. Those eight—called Synergistic processing elements—are designed specifically to juggle multimedia applications: video compression and decompression, encryption and decryption of copyrighted content, and, especially, rendering and modifying graphics.

The Synergistic elements were built from the ground up to do what are called single-precision floating-point calculations—the kind of operations needed for dazzling three-dimensional graphics and a host of other multimedia tasks. The design traded flexibility—a Synergistic element is not versatile enough to run the Linux operating system on its own—for eye-popping speed. When pushed to its 5.6-gigahertz limits, a single unit can do 44.8 billion single-precision floating-point calculations per second. Not wanting to cut Cell off from a role in scientific computing, its designers included circuitry in each Synergistic element that can do the more exacting calculations, called double-precision, that scientists demand, but its performance is only about one-tenth that of the single-precision unit.

In fact, the Synergistic elements are so fast that a single one could easily consume the entire bandwidth on the interconnects to the off-chip memory, leaving its siblings starved for data and stalled out. IBM and its partners had to design a special chunk of circuitry into Cell just to prevent that problem.

Apart from its raw power, Cell has content-protection tricks that should make it attractive to multimedia applications makers. For instance, the Synergistic element's architecture prevents any application or external device from accessing the element's local memory, so that, for instance, a program cannot steal a music file that is being decrypted by the processor. "Once you bring your code in and decrypt it, it can execute in a virtually trusted environment," says IBM's Cell architect Charles R. Johns. "All the data it calculates on, sends out, and brings in is fully protected."

The isolation function can be used in several ways, says Kahle. "We knew we couldn't anticipate all the different security needs in the future, but we wanted to know we had the right hardware to support a very robust security system."

Barry Minor's Mount Saint Helens simulator is a good example of how Cell's different processors work together. His program takes a satellite photo of the volcano, lines it up with an elevation map, and then turns it into a detailed 3-D terrain on the fly. The Mount Saint Helen's data has a resolution of 2.4 meters. The city of Austin, where the Cell design center is, once gave Minor access to its 15.4-centimeter-resolution satellite map. "You could land in Michael Dell's backyard and check out his view," Minor says with a grin.

What's happening inside the processor is a finely choreographed dance. The Power processing element starts by figuring out where the joystick is pointing the simulator in the stored 2-D maps. Then it divides that scene into 32 portions, four for each Synergistic element. Though perfectly capable of it, the Power processing element does no calculations on the actual data. Instead, it plays to its strength as a controller, figuring out which chunk of work should go to each of the other cores according to how complex the scene is and which cores have more or less time on their hands.

The Synergistic elements then go to work. They pull their portion of the data into their local memories, which they can access at great speed. Then each runs a rendering algorithm on the data and stores it off the chip in the system memory. When the processors are done, they signal the Power element, which instructs one of the synergistic units to run a video compression algorithm. That processor compresses its sister units' finished products and then pushes them out to be displayed on the screen or streamed to a PDA or some other device.

Because the compression takes less time than rendering the graphics, the compressing processor automatically switches gears when it's finished and runs the rendering algorithm on a portion of data until it's needed for compression again. With each frame, the process starts over.

This dance works so well for two reasons. The first has to do with the way Cell handles memory. Rather than waste several clock cycles waiting for the right data to arrive from memory, a Synergistic element works only on data stored in its own 256 kilobytes of memory, to which it has a high-bandwidth connection. More important, Cell's memory-handling engines can be programmed to keep data streaming through the processor. "We can get over 128 memory transactions going in flight at once," boasts Michael N. Day, a distinguished engineer at IBM.

The memory-access engine takes in new data and sends out the old just in time for the synergistic unit to perform the necessary calculations. When Cell runs Minor's volcano simulator, it waits for data to arrive from memory for only 1 percent of the time; the G5, in contrast, stands idle for about 40 percent of the time.

Cell's other key to speed has to do with breaking problems into parts that can be done in parallel. In Minor's simulation, it probably seems obvious that an image can be divided up into eight strips and these worked on independently. What wasn't so obvious was that the 3-D rendering could be done four pieces of data at a time within each synergistic processor. Such four-way parallel computing is called single instruction multiple data, or SIMD, and it is particularly well suited to the manipulation of graphics and other multimedia.

In these problems, you typically want to perform the same operation on each of the elements in a large chunk of data. For example, to increase the brightness of an image, you'd want to add the same number to every pixel in it. Since around the mid-1990s, general-purpose processors such as the Intel x86 architectures have been doing SIMD computing using a set of multimedia-specific instructions, explains Princeton's Lee, a multimedia instructions pioneer.

But SIMD instructions run far faster on Cell's Synergistic processors, because the Cell processors were designed from the start to handle them. And don't forget: there are eight such processors on each chip. Cell programmers spend most of their time turning complex algorithms into efficient SIMD algorithms, says Minor. "Once you've done that, you're 80 percent done."


« Previous Page 2 of 3 Next »
emailEmail PrintPrint CommentsComments ()  ReprintsReprints NewslettersNewsletters


WHITE PAPERS

Featured White papers:

More»

White papers:

      More»