Open Source Version of LinkSearch
In the open source version, we expect to combine standard information retrieval techniques with
links and link paths. For now, we shall ignore the content of web pages all
together and just concentrate on the content of the hypertext links.
By standard techniques, I mean the tf*idf term weights with cosine
normalization that is used in so many systems.
(For a reference, see any information retrieval book.)
I propose that we apply this standard technique to hypertext links &
paths in the following way:
- for each node in the web graph, you create a static node profile that
contains the text from all of its incoming links --- but not the
content of the web page for now. (Like standard IR
profiles, you would keep track of frequencies of words among links.)
- when processing a query, say a set of words Q, you find paths in the web
where the "union" of static profiles of the nodes includes all the words
in Q. Specifically, we want to find all paths P=(x1, ..., xm)
so that there exists a partition of Q, say Q1, ..., Qm
such that the link profile of xi contains Qi. For example, the
query "cse 142 course web" may yield the path
(URL for CSE, URL for UW CSE Education, URL for CSE Course Web, URL for CSE 142).
The path need not be minimal. For example, in this case, a shorter path
path containing only the last two urls may contain all query words. Yet, we also consider
the longer non-minimal path in our computation.
One way to find these paths is to observe that each node must
contain a distinct word of Q. So, we simply ensure that each new node
contributes a distinct word. For each step, we also see if the path built
so far contains all the query words.
The search for paths at query time is not that expensive; for
example,the following precomputed function can make things very fast:
given a query term t and a target node T, give me all source nodes S
that have an incoming link with term t and some edge to node T. You
can then use this precomputed function to search backwards for paths
whose node profiles contain the query terms. I tried this and it is
fast--- at least on the UW CSE web.
Now, these paths induce the set of urls returned by Q.
The urls are ranked as follows. Consider a url U induced by Q.
Let P1, P2, ..., Pk be the set of paths that induce U given query Q.
For each path P_i:
- compute the dynamic profile of P_i which is basically the "union"
of static profiles of the path nodes mentioned above
- compute the similarity of P_i and Q using P_i's dynamic profile
and the words in Q using standard IR techniques; call this S_i
- compute the probability that a user would follow path P_i when
browsing the web (instead of performing a query Q); basically, this
is proportional to the indegree of the first node of the path P_i.
This probability is the "importance" of the URL U with respect to
the query Q; call it "I_i".
Now,to determine the rank of U, we compute:
sum ( I_i * S_i )
over all
paths P_i
that induce U
The larger the sum, the higher U should be ranked.
Some intuition:
- the reason we look at links/paths instead of page content is because
the information in them is of much higher quality; you can really
tell what a page is about by looking at links/paths for that page.
- another reason is to handle the synonym problem: some people may
call a url something while others may use other words to describe it
(and the same thing happens in search engine queries)
- the reason we do not simply look at the text in a path of links but
rather consider all incoming links on each node in the path is to
increase recall and make the system "more relaxed". The query words
may not include the words on the path itself but rather just words on
incoming edges to nodes in the path
- the reason we sum over all the paths instead of combining the path
dynamic profiles into one big profile is to prevent "bad paths" for
query Q from messing up the "good paths" for query Q.
Now for the real-time features.
It seems to me that local search engines should be designed quite
differently from global ones. In particular, there are different
tradeoffs to be made. In local search engines...
- one is not worried so much about people abusing the system to make their
pages rank higher; so sophisticated "importance" algorithms such as those
used in Google are not needed; much simpler algorithms can be used
in an incremental, real-time manner that also provide high quality
rankings
- one has access to the web server logs which provide a rich source of
information about what urls are important & should be crawled;
when new urls are created, when urls are broken, etc...
- database compression is not that big of an issue so a dynamically
changing database that adapts to what is important at any moment
in time is feasible; at the same time, compression can be done by omitting
URLs that few people look at.
Specifically, I am thinking of local search engine that works as
follows:
- an initial crawl is not required; rather the crawl is a continuous process
- whenever an entry is made in the web server log (because a local link
is reached), the search engine immediately crawls --- if it has not done
so already --- the page of the referer URL and possibly also the
page for the URL reached
- the results are immediately put into the index
This approach has several advantages:
- it is real-time and adapts quickly to the browsing requirements
of users; no crawl depth limit is required
- whenever a user adds a new url to a web page in the site, they will
likely check it --- and this immediately causes the search engine
to crawl the relevant pages
- whenever a user deletes a url, the search engine will update its index
as soon as someone runs into it
- whenever a user changes the destination of a url, the change will also
be picked up when someone runs into it
- the search engine database can act as a cache with less frequently
visited urls dropped off dynamically while new ones are added on-the-fly.
Comments are welcome in the
public forums!
Amir Michail