Concepts in Computing
CS4 - Winter 2007
Instructor: Fabio Pellacini

Lecture 17: Web Search

Overview

  • What's out there?
  • Which documents are relevant to this query?
  • How should they be ranked?
  • How should they be organized?

Web crawling

The basic idea is to have a bunch of computers surfing the web on their own, keeping track of the pages they find and the links between the pages. How is that possible? As we've seen, when we GET a web page, the server just sends back the text, which happens to have HTML tags in it. The text can be read into a string (just like getting the value of a textarea). Then the links can be extracted by looking for parts of the string that say <a href="...">, and we can go off and GET those. Some issues:

  • Relative links must be made fully specified
  • Don't want to keep revisiting the same page, linked to from different places
  • Don't want to sit around idle, waiting for a GET to be handled
  • Want DNS lookup to be fast
  • Have to store the pages (and/or a compact representation of their main information) when they arrive

Information retrieval

Information retrieval has been around a long time, with systems for looking up books, journal articles, etc. The web, by its sheer size and diversity, made the problems harder, but some of the basic ideas are the same. Why do we say that one document is "like" another, or that a document is "relevant" to a particular search query?

One approach is to classify each document under predefined categories (this document is about college football; that one is about Beethoven). That's exactly what early web indices (e.g., early Yahoo!) did, with the help of human annotation. While this isn't exactly the way it was done, you could imagine the web being represented with one array containing all the various words (from "aardvark" to "zebra", or whatever), and a separate array for each such word, listing all the web pages with that word.

As an aside, if you have millions of words in your lexicon, and you want to do lots of searches, even binary search might be too slow a look-up method. A technique called hashing tries to jump exactly to the right place, without even having to keep splitting the array in half. The basic idea is to be able to compute, for each item that you want to be able to look up, a unique number that is its index in the array. Ideally, the number is easy to compute, and no two items give the same number. Two numeric representations of "chris", using "a"=0, ..., "z"=25:

  • Sum up the letters: "c"=2 + "h"=7 + "r"=17 + ....
  • Use base 26: 2*264 + 7*263 + 17*262 + ....

A problem with the first is that different words could easily sum up to the same number. A problem with the second is that the numbers are going to get way too huge. So what is often done is to take an approach kind of like the second, but "rolling over" when the number gets past a certain size. There can still be collisions (two words giving the same number), and one might have to do a little search to resolve them, or one might have to work a little harder on the hash function. Base 26 is actually not the best base to use in such a case (prime numbers lead to fewer collisions; anecdotally, 33, although not prime, works well).

Back to web search. While typically we no longer have predefined categories, we can store a document under all the words that it contains (somehow weighting how important each word is relative to that document). Then, given a query, we can find which documents contain all the query words.

An extension of this idea actually lets us easily compare two documents, to see how similar they are. Convert each document to a big array that says how many times the document has each different word in the lexicon (e.g., "aardvark":3 ... "zebra":8 -- must be a zoo article). We should probably weight words by how interesting they are ("the" wouldn't be interesting at all); note that we can use our index of the web to automatically determine how interesting a word is. Then just compare arrays to see how similar they are (e.g., dot product). Since the raw array might make it hard to see the similarity between a big zoo article and a small one (the raw counts are too different), we might want to convert the counts to percentages. We also have to recognize synonyms.

Ranking

At the heart of Google is the PageRank algorithm, described in the paper by Brin and Page when google was still at Stanford. PageRank takes advantage of the hypertext nature of the web. Pages aren't independent, but cross-reference each other. Intuitively, a page to which lots of pages link is likely to be more "authoritative" a reference than some random page out there that happens to have some of the query words. At the same time, if a page that is considered "authoritative" itself refers to a particular page, then that page is probably relevant. (This is analogous to citations in any discipline: win respect by being cited by a variety of people, or by one big-name person.)

The PageRank paper gives another intuition for ranking. Imagine a random surfer that just clicks on links, going from one page to another, maybe starting over every once in a while. How likely are they to end up on a particular page? It depends on how many links there are to that page, and whether those links are from obscure pages or from pages that are themselves likely to be hit.

The definition is kind of circular: if you are authoritative, you make your friends more authoritative, but how did you get authoritative to begin with? It actually sets up a nice linear algebra problem, and I encourage the mathematically-inclined to read more about it.

Unfortunately, PageRank can be abused (i.e., one can artificially inflate one's PageRank). It also suffers from problems analogous to those with citations. There are extensions and variations, but the basic idea does seem to nicely capture our intuition for how to rank the bazillions (or googols) of search hits.

Clustering

While a manual directory like the Yahoo! of old might be out of reach, there is certainly something to be said for a hierarchical directory of documents. Some search engines (e.g., Vivisimo's clusty) seek to automatically provide such an organization for the hits returned by a search. Google news automatically clusters stories on the same topic.

In general, clustering algorithms seek to automatically group a set of objects into clusters, such that the objects within a cluster are relatively similar to each other and relatively dissimilar to the objects in other clusters. We've already discussed what it means for two documents to be similar (although there are certainly other definitions).

How do we cluster things, once we know how to evaluate their similarity? There are lots of algorithms for that, too. To get an intuition for what they're like, let's consider hierarchical clustering, which generates a nice nested structure. Start with the two most similar objects, and group them together. Then group the next most-similar pair, then the next, etc. One question: when we've grouped objects together, how do we evaluate their joint similarity to other objects? One simple approach is to average: the similarity of the group {a,b} to the object c is the average of the a to c similarity and the b to c similarity.