Effective Web-Scale Crawling Through Website Analysis
The web crawler space is often delimited into two general areas - that of full web crawling and that of focus, or site/page specific crawling. The following paper presents a general overview and experimental results of a self-focusing crawler. The system begins as a full web crawl, which has a specified set of features which are of interest to the crawler client. The crawl then systematically samples and analyzes web sites as it moves through the general web, biasing its efforts toward sites with the provided relevant attributes. This crawl employs lightweight heuristics and a unique architecture which allows it to accurately score unknown webpages from a known site while not requiring a record for every page on the World Wide Web.
Gonzlez, I., Marcus, A., Meredith, D. N., and Nguyen, L. A. 2006. Effective web-scale crawling through website analysis. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 1041-1042.
Other items being presented by these speakers
Sponsor of The CIO Dinner