| Skip to main content | Skip to navigation |

Register Now!

Towards Practical Genre Classification of Web Documents

  • George Ferizis, CSIRO ICT Centre, Canberra, Australia
  • Peter Bailey, CSIRO ICT Centre, Canberra, Australia

Full text:

Poster:

Track: Posters

Classification of documents by genre is typically done either using linguistic analysis or term frequency based techniques. The former provides better classification accuracy than the latter but at the cost of two orders of magnitude more computation time. While term frequency analysis requires much less computational resources than linguistic analysis, it returns poor classification accuracy when the genres are not sufficiently distinct. A method that removes or approximates the expensive portions of linguistic analysis is presented. The accuracy and computation time of this method is then compared with both linguistic analysis and term frequency analysis. The results in this paper show that this method can significantly reduce the computation of both time of linguistic analysis and term frequency analysis, while retaining an accuracy that is higher than that of term frequency analysis.

Citation

Li, H., Councill, I., Lee, W., and Giles, C. L. 2006. CiteSeerx: an architecture and web service design for an academic document search engine. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 883-884.
DOI= http://doi.acm.org/10.1145/1135777.1135926

Citation

Chakravarthy, A., Lanfranchi, V., and Ciravegna, F. 2006. Requirements for multimedia document enrichment. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 903-904.
DOI= http://doi.acm.org/10.1145/1135777.1135936

Citation

Mishne, G. and de Rijke, M. 2006. Deriving wishlists from blogs show us your blog, and we'll tell you what books to buy. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 925-926.
DOI= http://doi.acm.org/10.1145/1135777.1135947

Citation

Ferizis, G. and Bailey, P. 2006. Towards practical genre classification of web documents. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 1013-1014.
DOI= http://doi.acm.org/10.1145/1135777.1135991

Citation

Parr, C. S., Parafiynyk, A., Sachs, J., Ding, L., Dornbush, S., Finin, T., Wang, D., and Hollander, A. 2006. Integrating ecoinformatics resources on the semantic web. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 1073-1074.
DOI= http://doi.acm.org/10.1145/1135777.1136021

Organised by

ECS Logo

in association with

BCS Logo ACM Logo

Platinum Sponsors

Sponsor of The CIO Dinner


Become a sponsor or exhibitor
Valid XHTML 1.0! IFIP logo WWW Conference Committee logo Web Consortium logo Valid CSS!