IMAGE: DAVID MCNEW/GETTY IMAGES
|
Search engines are our key to access information on
the Web. Without search engines, we would easily become
lost in cyberspace (as in the early days of the Web), so
it is not surprising to see how heavily we rely on
search engines as our information gateways. According to
the Search
Engine Round Table blog, Jay McCarthy,
vice president of Web server log analysis company
Websidestory, announced at the 2005 Search Engine
Strategies Conference in Toronto that the number of page
referrals from search engines has surpassed those from
other pages. This means that people navigate the Web by
searching more than by browsing.
The question of search engine bias then becomes a
crucial one. What if search engines showed only certain
types of information, or preferred certain sources?
Imagine for example submitting the query abortion and
finding only pro-life (or only pro-choice) sites in the
first screen of hits. There are many kinds of potential
bias—linguistic, political, cultural, commercial, and
so on. The issue of bias resonates in the public debate
on our growing dependence on search engines and on their
social impact as gatekeepers of information. Is an
information monopoly developing the same way as the
software monopoly of the recent past? Is Google the next
Microsoft? If search engines are the lens through which
we see the world, transparency is a major concern, and
any bias gets in the way. Our worries are heightened
because search engines are secretive about their
algorithms and, thus, their biases are subtle to detect.
In the midst of this debate, one kind of bias that
has received much attention among technologists, as well
as social and political scientists, is that in favor of
"popular" sites. This stems from the PageRank algorithm,
introduced by the Google founders in 1998. All major
search engines today use similar techniques to identify
important or prestigious pages and bubble them to the
top of the results. To a first approximation, PageRank
attributes importance in proportion to the number of
links that a page receives from other sites. The
algorithm is a bit more sophisticated than that, but
this approximation turns out to be pretty good on
average (cf.,
http://arXiv.org/cs.IR/0511016).
The notion of prestige based on link popularity is a
proxy for other possible importance measures, such as
traffic, expert judgment, and so on. Most people would
agree that the use of prestige measures in ranking
search results is a very good thing—indeed, it's the
main reason why search engines work so well and have
become so popular. Moreover, PageRank is designed to
mimic the browsing behavior of Web users. In the absence
of better assumptions, we imagine that people follow
links at random. PageRank then estimates the traffic
through each site. It seems, therefore, to be just the
right criterion to rank sites. Why worry then?
To understand the potential danger of popularity
bias, let us envision a scenario in which people search
for information about the minollo (an imaginary
animal). Imagine that there is an
established site called minollo-recipes.com about the
minollo and its culinary qualities. Further imagine a
newly developed site called save-the-minollo.org that
holds the view that the minollo is an endangered species
and it should no longer be hunted. Now, suppose a
student is assigned the homework of creating a Web page
with a report on the minollo. The student will submit
the query 'minollo' to a search engine and, for lack of
time, browse only the top ten hits. Let's say that
minollo-recipes.com is the fifth hit, while
save-the-minollo.org is ranked 15th. The student will
read the established site and write her report on
minollo recipes. She will not read about the possible
endangered status of the minollo. She will also
diligently cite her source by adding a link from her new
page to minollo-recipes.com.