| Skip to main content | Skip to navigation |

Register Now!

Do not Crawl in the DUST: Different URLs with Similar Text

  • Uri Schonfeld, Technion - Israel Institute of Technology, Israel
  • Ziv Bar-Yossef, Technion - Israel Institute of Technology, Israel
  • Idit Keidar, Technion - Israel Institute of Technology, Israel

Full text:

Track: Posters

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, translates URLs to some canonical form, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules for transforming a given URL to others that are likely to have similar content. DustBuster is able to detect DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from this information to increase the effectiveness of crawling, reduce indexing overhead as well as improve the quality of popularity statistics such as PageRank.

Citation

Caminero, R. C., Zavarsky, P., and Mikami, Y. 2006. Status of the African Web. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 869-870.
DOI= http://doi.acm.org/10.1145/1135777.1135919

Citation

Gatterbauer, W. 2006. Estimating required recall for successful knowledge acquisition from the web. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 969-970.
DOI= http://doi.acm.org/10.1145/1135777.1135969

Citation

Tongia, R. 2006. Why is connectivity in developing regions expensive: policy challenges more than technical limitations?. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 991-992.
DOI= http://doi.acm.org/10.1145/1135777.1135980

Citation

Schonfeld, U., Bar-Yossef, Z., and Keidar, I. 2006. Do not crawl in the DUST: different URLs with similar text. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 1015-1016.
DOI= http://doi.acm.org/10.1145/1135777.1135992

Citation

Li, H., Aghili, S. A., Agrawal, D., and El Abbadi, A. 2006. FLUX: fuzzy content and structure matching of XML range queries. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 1081-1082.
DOI= http://doi.acm.org/10.1145/1135777.1136025

Other items being presented by these speakers

Organised by

ECS Logo

in association with

BCS Logo ACM Logo

Platinum Sponsors

Sponsor of The CIO Dinner


Become a sponsor or exhibitor
Valid XHTML 1.0! IFIP logo WWW Conference Committee logo Web Consortium logo Valid CSS!