| Skip to main content | Skip to navigation |

Register Now!

WebKhoj: Indian language IR from Multiple Character Encodings

  • Prasad Pingali, International Institute of Information Technology, Hyderabad, India
  • Jagadeesh Jagarlamudi, Language Technologies Research Centre, IIIT, India
  • Vasudeva Varma, International Institute of Information Technology, Hyderabad, India

Full text:

Presentation Slides:

Track: Technology for Developing Regions

Today web search engines provide the easiest way to reach information on the web. In this scenario, more than 95% of Indian language content on the web is not searchable due to multiple encodings of web pages. Most of these encodings are proprietary and hence need some kind of standardization for making the content accessible via a search engine. In this paper we present a search engine called WebKhoj which is capable of searching multi-script and multiencoded Indian language content on the web. We describe a language focused crawler and the transcoding processes involved to achieve accessibility of Indian langauge content. In the end we report some of the experiments that were conducted along with results on Indian language web content.

Citation

Pingali, P., Jagarlamudi, J., and Varma, V. 2006. WebKhoj: Indian language IR from multiple character encodings. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 801-809.
DOI= http://doi.acm.org/10.1145/1135777.1135898

Other items being presented by these speakers

Organised by

ECS Logo

in association with

BCS Logo ACM Logo

Platinum Sponsors

Sponsor of The CIO Dinner


Become a sponsor or exhibitor
Valid XHTML 1.0! IFIP logo WWW Conference Committee logo Web Consortium logo Valid CSS!