The RapidMind
platform started as a language, called “Sh,”
that McCool developed for graphics processors. The
language grew out of an insight McCool had years
ago—that the massively parallel computing provided by a
graphics processor’s multiple cores can be used for
things other than rendering pixels.
Recent results bear this out. Researchers at
Hewlett-Packard, in Palo Alto, Calif., reported in
November that a graphics processor programmed with the
RapidMind platform executed an options-pricing program
called the “Black-Scholes benchmark” 32.2 times as fast
as a general-purpose CPU.
McCool, who exudes an endearingly geeky bravado and
who still teaches computer science at the University of
Waterloo, in Ontario, says putting the RapidMind
platform in the hands of the people who need it the most
was the best way to realize the full potential of his
research. “It wasn’t really about us making lots of
money—although that’s nice,” he says. “For me it was
about cool technology and using it in the real world
with real customers.”
So three years ago, he asked his research assistant
Stefanus Du Toit to use the Sh language to create a
programming platform for multicore processors. Together,
McCool and Du Toit founded the company that would become
known as RapidMind.
It took Du Toit about a year, but in the end, he and
McCool had something good enough to show Matthew
Monteyne, a former senior product manager with
Waterloo’s most famous technology company, Research in
Motion (RIM), maker of the BlackBerry wireless e-mail
device. Monteyne, now vice president of sales and
marketing at RapidMind, recruited his former boss, Ray
DePaul, director of BlackBerry product management, to
come on board as president and CEO. In McCool’s
prototype, they both sensed an unusual opportunity.
“There hasn’t been a revolution in processors and how
you program them since maybe object-oriented programming
in the early ’90s,” DePaul says.
The introduction of a disruptive technology like
multicore CPUs provides a great chance for small
companies to pounce. “You don’t come into mature
markets,” DePaul says. “You come in when there’s this
whirlwind of activity, and the big guys are too focused
on the current business that they can’t go after the new opportunity.”
McCool’s goal
for a commercial product was simple enough:
“I wanted to build something that I could teach in about
10 minutes, that you could use without mental overhead
so you can focus on the algorithms, not the details of
the particular processor,” he explains.
Programmers need to focus on devising parallel
algorithms because RapidMind can’t write parallel
algorithms for them. No software can. While there has
been a lot of research into automatically parallelizing
applications for programmers, no such system has been
commercially viable. “People have been working on this
for 20 or 30 years, and it doesn’t look like it’s a
solvable problem,” McCool says.
That means programmers accustomed to writing serial
algorithms must learn how to think about parallel
algorithms. One of the benefits of working with the
RapidMind platform is that users become familiar with a
conceptual model of a parallel machine. “It’s similar
enough to a real parallel machine that you can reason
about what is an efficient way to implement an
algorithm,” McCool says.
To write an application using RapidMind, the
programmer first identifies the components to
accelerate. These tend to be the numerically intensive
operations. For instance, a chip running a game might
spend a lot of time computing physical interactions
between hundreds of thousands of objects, computations
that would speed up tremendously if done in parallel.
That’s in contrast to trivial operations such as
tabulating the player’s score or processing input from a
game-controller button or joystick.
The RapidMind platform is designed to be incorporated
into any program written in C++, one of the most widely
used programming languages in the world. Programmers
write their programs in C++, using their favorite C++
editing and debugging programs, of which there are
hundreds. Next they select the portion of the program to
be accelerated and formulate the necessary parallel
algorithms. Then they write code that expresses those algorithms.
Several features make the task easier. Like any modern
high-level programming language, C++ has a library of
commonly needed subroutines and functions, simplifying
life for programmers. When they need one of those
functions—sorting a set of numbers, say—they merely
insert a word in their program that calls it up.
However, while working with the RapidMind platform,
instead of writing code using ordinary C++ terms that
refer to subroutines and functions in a C++ library, the
programmer uses words from RapidMind’s vocabulary that
refer to subroutines and functions stored in the
RapidMind library. These words call up subroutines and
functions that execute in parallel. The programmer must
specify the data sets that will be operated on in
parallel, but the subroutines take it from there.
Programmers don’t need to know any of the specifics of
the chip their software will run on. When the program
starts up, the RapidMind platform determines whether it
is running on a graphics processor, a Cell, or
something else and translates the code that the
programmer has written into code that the particular
chip understands.
At the same time, the platform breaks up arrays of
data into chunks that get doled out to however many
cores are available on the target chip. The more cores,
the more finely the chunks are chopped. To ensure that
each core is working on something all the time, the
system assigns data and tasks to cores on the fly,
depending on which ones signal that they are free for
the next piece of work. So, for example, while one core
is churning through an especially complicated operation
for a long time, its fellow cores can be kept busy with
lots of simpler operations.
What the Experts Say
GORDON BELL: Computer scientists haven’t been
interested in programming clusters. If putting the
cluster on a chip is what excites them, fine. It
will still have to run Fortran!
Without such dynamic load balancing, computationally
intensive applications, including real-time ray tracing,
are extraordinarily difficult to pull off. Real-time ray
tracing is a technique that models the paths and effects
of light as it interacts with various surfaces.
Typically, millions of rays hit dozens or hundreds of
objects, where the rays can be absorbed, reflected, or
refracted. Of course, most of the rays miss the objects
and keep going—events that McCool calls cheap operations
because the path they trace remains the same. The
expensive calculations, the ones that must be performed
to determine the trajectory of a ray when it hits a drop
of water, say, can require 100 times as much work on the
processor’s part.
Because the RapidMind platform can dynamically
allocate both cheap and expensive tasks, the ray-tracing
application can take full advantage of the power of
parallel processing to execute in real time. That’s
because there are so many pixels whose color and shading
need to be determined at any one time that all the
processors can be occupied with computational tasks.
Compare that to, say, an Intel Xeon dual-core chip
running an operating system, a Web browser, and some
desktop applications. Its two processors might sit idle
half the time waiting for something to do. The RapidMind
platform strives to ensure that no core—or clock cycle,
for that matter—goes to waste.