GOOGLE NEWS,
arguably the leading site when it comes to personalized
news, goes several steps further, automating things as
much as possible. For example, it uses a technique
called implicit personalization to recommend different
content to each reader, based on the reader’s past
behavior. It’s an innovation that suggests a way forward
for the news business. But first, consider how Google
News accomplishes two other seemingly simple automation
chores: ranking and clustering stories.
Google is, of course, famous for its method of ranking
search results. In the case of news, it forms an
understanding of which stories are generally the most
interesting and important and continually updates a
reader’s personal home page with that in mind. Google
News collects millions of articles from thousands of
sources, so it would be out of the question to use a
staff of editors to lay out the front page, as most news
sites do. Krishna Bharat, who led the development of
Google News, says that its algorithm ranks stories
according to the authority of the news source, the
timeliness of the article, whether the article is an
original piece, where the article was originally placed
by the editors on the source Web site, the apparent
scope and impact on readers, and the popularity of the article.
To cluster news stories, we have to define “same
event”—an ill-defined, surprisingly hard problem
Google News also clusters stories on the same news
event. Clustering gives readers the benefit of
diversity, which is particularly useful to readers of
international news. For example, a French paper might
take a profarmer stance when covering a trade dispute on
European Union farming subsidies, while a British
newspaper might have a very different view. Another
advantage of clustering is that it can either eliminate
or call explicit attention to duplicate articles, such
as when two newspapers run the same Associated Press
wire story.
But the task of clustering news stories on the same
event encompasses several subchores, some of them fairly
difficult. One of them is simply defining what we mean
by “same event”—an ill-defined, surprisingly hard
problem. For instance, stories about the escape of a
tiger from the San Francisco Zoo last December included
articles on how the animal may have gotten free, how it
killed a visitor, how it mauled two other people, how it
was itself killed by police officers. Are they all the
same event?
Google News tackles this problem by using a technique
called hierarchical agglomerative clustering. Basically,
it puts news articles with similar phrasing together
into distinct piles. It starts by analyzing the content
of articles to find those that share keywords or key
phrases; articles that have enough language in common
are assumed to be covering similar topics. The articles
in each pile are connected based on the strength of
their similarity. To visualize these connections,
imagine a treelike structure where the articles are the
leaves. If we grab a branch from the tree, the many
leaves on that branch are all similar articles—that is,
articles about the same general event. Thus a group of
leaves near one another on a branch of the tree
constitutes a cluster.
This tree is constantly changing. As more and more
stories accrue on a general event, the threshold for
determining whether any two of those stories are about
the same aspect of that event becomes higher. The
clusters may shift, with articles jumping out to new
groups or old groups that are splitting or combining.
The groupings adapt to the news available, which is
always changing.
If the ideal result is a newspaper featuring the news
you want to see, these clustering and ranking strategies
can take you only so far. They can determine whether a
new development in a story you’ve been following is
something that might interest you. But they can’t make a
logical leap—for example, recognizing from your previous
interest in articles on the search for extraterrestrial
intelligence that you would be fascinated by the
discovery of an Earth-like planet in another solar system.