Open Source Version of LinkSearch

In the open source version, we expect to combine standard information retrieval techniques with links and link paths. For now, we shall ignore the content of web pages all together and just concentrate on the content of the hypertext links.

By standard techniques, I mean the tf*idf term weights with cosine normalization that is used in so many systems. (For a reference, see any information retrieval book.)

I propose that we apply this standard technique to hypertext links & paths in the following way:

for each node in the web graph, you create a static node profile that contains the text from all of its incoming links --- but not the content of the web page for now. (Like standard IR profiles, you would keep track of frequencies of words among links.)
when processing a query, say a set of words Q, you find paths in the web where the "union" of static profiles of the nodes includes all the words in Q. Specifically, we want to find all paths P=(x1, ..., xm) so that there exists a partition of Q, say Q1, ..., Qm such that the link profile of xi contains Qi. For example, the query "cse 142 course web" may yield the path (URL for CSE, URL for UW CSE Education, URL for CSE Course Web, URL for CSE 142). The path need not be minimal. For example, in this case, a shorter path path containing only the last two urls may contain all query words. Yet, we also consider the longer non-minimal path in our computation.

One way to find these paths is to observe that each node must contain a distinct word of Q. So, we simply ensure that each new node contributes a distinct word. For each step, we also see if the path built so far contains all the query words.

The search for paths at query time is not that expensive; for example,the following precomputed function can make things very fast: given a query term t and a target node T, give me all source nodes S that have an incoming link with term t and some edge to node T. You can then use this precomputed function to search backwards for paths whose node profiles contain the query terms. I tried this and it is fast--- at least on the UW CSE web.

Now, these paths induce the set of urls returned by Q.

The urls are ranked as follows. Consider a url U induced by Q.

Let P1, P2, ..., Pk be the set of paths that induce U given query Q.

For each path P_i:

compute the dynamic profile of P_i which is basically the "union" of static profiles of the path nodes mentioned above
compute the similarity of P_i and Q using P_i's dynamic profile and the words in Q using standard IR techniques; call this S_i
compute the probability that a user would follow path P_i when browsing the web (instead of performing a query Q); basically, this is proportional to the indegree of the first node of the path P_i. This probability is the "importance" of the URL U with respect to the query Q; call it "I_i".

Now,to determine the rank of U, we compute:

sum ( I_i * S_i )
over all
paths P_i
that induce U

The larger the sum, the higher U should be ranked.

Some intuition:

the reason we look at links/paths instead of page content is because the information in them is of much higher quality; you can really tell what a page is about by looking at links/paths for that page.
another reason is to handle the synonym problem: some people may call a url something while others may use other words to describe it (and the same thing happens in search engine queries)
the reason we do not simply look at the text in a path of links but rather consider all incoming links on each node in the path is to increase recall and make the system "more relaxed". The query words may not include the words on the path itself but rather just words on incoming edges to nodes in the path
the reason we sum over all the paths instead of combining the path dynamic profiles into one big profile is to prevent "bad paths" for query Q from messing up the "good paths" for query Q.

Now for the real-time features.

It seems to me that local search engines should be designed quite differently from global ones. In particular, there are different tradeoffs to be made. In local search engines...

one is not worried so much about people abusing the system to make their pages rank higher; so sophisticated "importance" algorithms such as those used in Google are not needed; much simpler algorithms can be used in an incremental, real-time manner that also provide high quality rankings
one has access to the web server logs which provide a rich source of information about what urls are important & should be crawled; when new urls are created, when urls are broken, etc...
database compression is not that big of an issue so a dynamically changing database that adapts to what is important at any moment in time is feasible; at the same time, compression can be done by omitting URLs that few people look at.

Specifically, I am thinking of local search engine that works as follows:

an initial crawl is not required; rather the crawl is a continuous process
whenever an entry is made in the web server log (because a local link is reached), the search engine immediately crawls --- if it has not done so already --- the page of the referer URL and possibly also the page for the URL reached
the results are immediately put into the index

This approach has several advantages:

it is real-time and adapts quickly to the browsing requirements of users; no crawl depth limit is required
whenever a user adds a new url to a web page in the site, they will likely check it --- and this immediately causes the search engine to crawl the relevant pages
whenever a user deletes a url, the search engine will update its index as soon as someone runs into it
whenever a user changes the destination of a url, the change will also be picked up when someone runs into it
the search engine database can act as a cache with less frequently visited urls dropped off dynamically while new ones are added on-the-fly.

Comments are welcome in the public forums!

Amir Michail