Crawlers, (Web) spiders, (Web) robots: autonomous user agents that retrieve pages from the Web.
Basics of crawling
- Start with a given URL or set of URLs
- Retrieve and process the corresponding page
- Also, Discover new URLs (cf. next slide)
- Repeat on each found URL
No real termination condition (virtual unlimited number of Web pages!)
- deep-first: not very adapted, the possibility of being lost in robot traps
- breadth-first: combination of both: breadth-first with limited-depth deep-first on each discovered website
Sources of new URLs
From HTML pages:
hyperlinks <a href= “…”>…</a >
Other hyperlinked content (e.g., PDF files)
- Non-hyperlinked URLs that appear anywhere on the Web (in HTML text, text files, etc.): So, use regular expressions to extract them
- Also, Referrer URLs Sitemaps
Scope of a crawler
- The Web is infinite! Avoid robot traps by putting depth or page number limits on each Web server
- Also, Focus on important pages.
- Web servers under a list of DNS domains: easy filtering of URLs
- Moreover, A given topic: focused crawling techniques based on classifiers of Web page content and predictors of the interest of a link.
- Moreover, The national Web (cf. public deposit, national libraries): what is this?
- Also, A given Web site: what is a Web site?
Identification of duplicate Web pages
Also, Identifying duplicates or near-duplicates on the Web to prevent multiple indexing o trivial duplicates: same resource at the same canonized URL:
- http://example.com:80 /Toto
- http://example.com/titi/../toto o exact duplicates: identification by hashing o near-duplicates: (timestamps, the tip of the day, etc.) more complex!
The standard for robot exclusion: robots.txt at the root of a Web server.
- User-agent: *
- Allow: /searchhistory/
- Disallow: /search
- Per-page exclusion (de facto standard).
- <meta name=”ROBOTS” content=”NOINDEX,NOFOLLOW” >
- Per-link exclusion (de facto standard).
- <a href=”toto.html” rel=”nofollow”>Toto</a >
Avoid Denial Of Service (DOS), wait 100 ms/1s between two repeated requests to the same Web server