Astronomers Are Drowning
In Data. To take one example, the
ambitious Sloan Digital Sky Survey is using ground-based
telescopes equipped with digital cameras to record a
quarter of the sky with unprecedented accuracy and
depth; its latest release of images and related data
totaled six terabytes. The newest catalog of star
positions, published by the U.S. Naval Observatory in
Washington, D.C., contains more than a billion stars,
while a single observation from the Hubble Space
Telescope can easily swallow several gigabytes. In the
next few years, as the archives from these and other
instruments continue to swell and a number of new
large-scale projects come online, the total amount of
data is expected to double every year or two. Not
surprisingly, astronomers are already worried about how
to find what they need. Similarly crushing tides of data
await lots of other people in other fields, whether they
work in multinational corporations or in large,
geographically dispersed research projects.
This abundance of astronomy data is actually fairly
new. Decades ago, observations were made for a project
and then thrown away—if you wanted to ask another
question about the same star or galaxy, you went back to
the telescope. The launch of the first space telescopes
in the 1970s, such as the Einstein Observatory and the
International Ultraviolet Explorer, changed that. Their
high cost convinced researchers, and funders, that the
data were too precious to lose.
In the process, archival astronomy was born.
Researchers quickly learned that data gathered for one
star could be reused to study other stars; hundreds of
papers were generated in this way. Nowadays,
astronomers, more than other scientists, tend to share
their data. Those who observe on a NASA space facility,
for example, get exclusive access to their data for only
one year—after that, anyone can download it and look
for discoveries that the original observer may have
missed. [For a glimpse of how modern astronomy is done,
and how the VO might help, see box, "."]
To take the next step and make any data set available
to any astronomer anywhere in the world will mean
solving a number of major challenges. These include
resolving differences in data format, defining a query
language for accessing the VO, creating the
computational infrastructure, figuring out how to keep
the VO up to date as new data sets are created, and, of
course, getting all the players to agree on the many
software standards and protocols.
Astronomical Data Take
Many Forms, depending on the instrument
that collected them and the format and medium they were
stored in. Some records aren't digitized; a lot of radio
astronomy data, for example, are still on analog
nine-track tape. Some smaller observatories don't even
archive their data; instead, researchers take home
whatever raw data they collect. VO collaborators hope
that as the virtual observatory comes online and begins
to prove its worth, the data laggards will devote the
resources necessary to create or upgrade their databases.
Further complicating things is that different archives
can refer to the same object by different names. The
International Astronomical Union, based in Paris,
oversees the naming of celestial objects (and no, you
can't pay to have a star named after you), but that
doesn't prevent other unsanctioned designations from
popping up. So when comparing astronomical catalogs
covering two types of wavelengths, researchers must also
typically check an object's position. Such
double-checking fails, however, if the object in
question is visible (and therefore recorded) only at one
of the wavelengths.
Tracking down data sets, which can take weeks or even
months, became somewhat more straightforward in 1996,
with the creation of the online SIMBAD service. Run by
the Stellar Data Center, SIMBAD lets researchers call up
a list of papers that cite a celestial object, plus its
other names, its position, and a few other numbers. What
SIMBAD doesn't tell you is where the data are archived;
nor does it return actual data with which you can do
actual science.
For that, you need the VO. With it, a user in Chicago
will be able to sit at her computer, type in a data
request—say, all the brightness information at all
different colors of the spectrum for quasar
PG1407+265—and then wait for the data to come in.
Behind the scenes, her query may be processed by a Web
portal at Caltech, which in turn searches several
archives, including one based in Strasbourg that lists
star locations and another in Cambridge, Mass., that
knows the stars' X-ray intensities. After the searches,
the Web portal gathers up all the results and replies to
the user.
As this scenario suggests, the Virtual Observatory is
a distributed system, much like the Internet itself. To
link its disparate parts—to "federate" them, as
astronomers say—the VO is being built around
registries. A registry is basically an online catalog of
what is in each archive, indexing the virtual sky by
position and wave band; it is continually updated to
incorporate new data and new archives. Functionally, a
VO registry is like the domain name servers that point
to things on the Internet. Prototype VO registries are
already running at Johns Hopkins, the University of
Illinois, and Caltech. The Data Inventory Service,
created by Thomas McGlynn and colleagues at NASA Goddard
Space Flight Center, calls on these registries (and
eventually others) to locate data based on an object's
position or name (see http://heasarc.gsfc.nasa.gov/vo/).
The VO registries will also point to Web-based
programs, known as Web services, which will allow data
from those archives to be processed. Astronomers already
use various software tools for analyzing and filtering
their data, but such programs are designed to run on
local workstations, using locally stored data. A Web
service, by contrast, is accessed through the Internet,
and the user may not even know it is running. The VOStat
service, for example, lets users run many types of
statistical routines on their data; the user doesn't
need to worry about having the latest statistical software.
Sifting through these disparate databases is eased by
past attempts at data standardization, such as the
Flexible Image Transport System (FITS) format. Invented
by radio astronomers back in the 1970s to exchange data
on magnetic tape, FITS has since been widely adopted by
other astronomers, and FITS files can now be read by
almost all astronomical software.
But FITS typically can't be read by mainstream
software. The VO team therefore plans to supplement FITS
files with eXtensible Markup Language descriptions of
the data. Although XML is fast becoming the common text
format for exchanging a wide variety of data on the Web
and elsewhere, astronomers have been relatively late to
embrace it. The VO's first use of XML is the VOTable
format, developed by groups at the Stellar Data Center
and Caltech, for exchanging tables and star catalogs.
Another problem with FITS is that it allows each group
to make up the keywords that describe what the file
contains; uninitiated astronomers have no way of
deciphering these custom keywords. What's needed is a
precise and universal vocabulary. For example, if I ask
the VO for data about photon frequencies, I don't want
data about stellar pulsation frequencies. The UCD (or
Unified Content Descriptor), invented by the
star-catalog experts at Strasbourg, is a first cut at
defining such an unambiguous vocabulary. Initially, the
VO will use UCDs to augment FITS keywords; eventually,
they could become the sole means of describing a file's contents.
Many Of The Vo
Astronomers pride themselves on their
computer savvy. Even so, they can be overwhelmed by the
latest software techniques and jargon. VO computer
scientists are equally lost in the zoo of celestial
fauna, which includes such exotica as exoplanets,
magnetars, and superclusters, to say nothing of the
arcana of astronomical instrumentation. An "object" to
astronomers may be an enormous physical thing in outer
space, but to computer scientists familiar with
"object-oriented design" it is an abstraction describing
a concept in software.
A more serious ongoing debate revolves around how much
and what kinds of new computing techniques to
incorporate into the VO; the computer scientists lean
toward the most cutting-edge technologies, whereas the
astronomers worry whether the new ways will be useful
and stable in the long run.
One such argument involves the use of so-called
virtual data. At present, most astronomy archives store
their data as calibrated images; the calibration takes
into account deviations introduced by the
instrumentation—hot pixels on the charge-coupled device
(CCD) camera chip, say. With a virtual data system,
archives would store only raw, uncalibrated data; each
time a user would ask for a particular image, the data
would be processed and calibrated, and the image created
on the fly. Virtual data have the advantage of taking up
less storage space and being easier to archive. On the
other hand, such a system is fragile—a hardware change
or failure may render the software unable to process the
data, leaving the user with no means to generate images
at all.