| Skip to main content | Skip to navigation |

Register Now!

Effective Web-Scale Crawling Through Website Analysis

  • Ivan Gonzalez, Carnegie Mellon University, USA
  • Adam Marcus, Rensselaer Polytechnic Institute, USA
  • Daniel Meredith, IBM Almaden Research Center, USA
  • Linda Nguyen, IBM Almaden Research Center, USA

Full text:


Track: Posters

The web crawler space is often delimited into two general areas - that of full web crawling and that of focus, or site/page specific crawling. The following paper presents a general overview and experimental results of a self-focusing crawler. The system begins as a full web crawl, which has a specified set of features which are of interest to the crawler client. The crawl then systematically samples and analyzes web sites as it moves through the general web, biasing its efforts toward sites with the provided relevant attributes. This crawl employs lightweight heuristics and a unique architecture which allows it to accurately score unknown webpages from a known site while not requiring a record for every page on the World Wide Web.


Gonzlez, I., Marcus, A., Meredith, D. N., and Nguyen, L. A. 2006. Effective web-scale crawling through website analysis. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 1041-1042.
DOI= http://doi.acm.org/10.1145/1135777.1136005

Other items being presented by these speakers

Organised by

ECS Logo

in association with

BCS Logo ACM Logo

Platinum Sponsors

Sponsor of The CIO Dinner

Become a sponsor or exhibitor
Valid XHTML 1.0! IFIP logo WWW Conference Committee logo Web Consortium logo Valid CSS!