Web Graph Mining
- As all hyperlinked environments, the World Wide Web can be seen as a directed graph in the following way: Web pages are vertices of the graph, whereas hyperlinks between pages are edges.
- This viewpoint has led to major advances in Web search, notably with the PageRank and HITS algorithms presented in this section.
- Moreover, Extraction of knowledge from graphs, or graph mining, has been used on other graph structures than the Web, for instance on the graph of publications, where edges are the citation links between publications. Cocitation analysis relies on the observation that two papers that are cited by about the same set of papers are similar.
- Other graphs susceptible to this kind of analysis include graphs of dictionaries, or encyclopedias, or graphs of social networks.
PageRank : Web Graph Mining
- PageRank, which introduced with much success by the founders of the Google search engine, is a formalization of this idea.
- The PageRank of a page I can defined informally as the probability pr(i). That the random surfer has arrived on page I at some distant given point in the future.
- Consider for instance the basic graph on the left of Figure.
A random surfer will reach node A at step I if it reaches node B, C or D at step i – 1. Therefore:
pr(A) = pr(B) + pr(C) + pr(D)
Online Computation: Web Graph Mining
- The computation of PageRank by iterative multiplication of a vector by the dampened transition matrix requires the storage of. And efficient access to, the entire Web matrix.
- This can do on a cluster of PCs using an appropriate distributed storage technique (see the chapters devoted to distributed indexing and distributed computing).
- An alternative to compute PageRank while crawling the Web, on the fly. By making use of the random walk interpretation of PageRank.
This is what the following algorithm, known as OPIC for Online PageRank Importance Computation, does:
- Store a global cashflow G, initially 0.
- Store for each URL ‘u’ its cash C(u) and history H(u).
- Initially, each page has some initial cash c0.
- So, We crawl the Web, with some crawling strategy that accesses repeatedly a given URL.
- When accessing URL u:
- we set H(u) := H(u) + C(u);
- Moreover, we set C(u’) := C(u)/nu for all URL u’ pointed to by u, with nu the number of outgoing links in u.
- we set G := G + C(u) and C(u) := 0.
- At any given time, the PageRank of u can approximate as H(u)/G.
HITS: Web Graph Mining
- The HITS algorithm (Hypertext Induced Topic Selection) another approach proposed by Kleinberg.
- The main idea is to distinguish two kinds of Web pages: hubs and authorities.
- Hubs are pages that point to good authorities, whereas authorities pages that pointed to by good hubs.
- As with PageRank, we use again a mutually recursive definition that will lead to an iterative fixpoint computation.
Spamdexing: Web Graph Mining
- The term spamdexing describes all fraudulent techniques that used by unscrupulous Webmasters to artificially raise the visibility of their website to users of search engines.
- As with virus and antivirus, or spam and spam fighting, spamdexing. And the fight against it is an unceasing series of techniques implemented by spamdexers, closely followed by counter techniques deployed by search engines.
Discovering Communities on the Web
- The graph structure of the Web can use beyond the computation of PageRank-like importance scores.
- Another important graph mining technique that can use on the Web is graph clustering:
- Using the structure of the graph to delimitate homogeneous sets of Web pages ( e.g., communities of Web sites about the same topic).
- The assumption is that closely connected set of pages in a Web page will share some common semantic characteristic.