| Skip to main content | Skip to navigation |

Register Now!

Robust Web Content Extraction

  • Marek Kowalkiewicz, The Poznan University of Economics, Poland
  • Maria Orlowska, The University of Queensland, Australia
  • Tomasz Kaczmarek, The Poznan University of Economics, Poland
  • Witold Abramowicz, The Poznan University of Economics, Poland

Full text:

Poster:

Track: Posters

We present an empirical evaluation and comparison of two content extraction methods in HTML: absolute XPath expressions and relative XPath expressions. We argue that the relative XPath expressions, although not widely used, should be used in preference to absolute XPath expressions in extracting content from human-created Web documents. Evaluation of robustness covers four thousand queries executed on several hundred webpages. We show that in referencing parts of real world dynamic HTML documents, relative XPath expressions are on average significantly more robust than absolute XPath ones.

Citation

Kowalkiewicz, M., Orlowska, M. E., Kaczmarek, T., and Abramowicz, W. 2006. Robust web content extraction. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 887-888.
DOI= http://doi.acm.org/10.1145/1135777.1135928

Organised by

ECS Logo

in association with

BCS Logo ACM Logo

Platinum Sponsors

Sponsor of The CIO Dinner


Become a sponsor or exhibitor
Valid XHTML 1.0! IFIP logo WWW Conference Committee logo Web Consortium logo Valid CSS!