Classification of documents by genre is typically done either using linguistic analysis or term frequency based techniques. The former provides better classification accuracy than the latter but at the cost of two orders of magnitude more computation time. While term frequency analysis requires much less computational resources than linguistic analysis, it returns poor classification accuracy when the genres are not sufficiently distinct. A method that removes or approximates the expensive portions of linguistic analysis is presented. The accuracy and computation time of this method is then compared with both linguistic analysis and term frequency analysis. The results in this paper show that this method can significantly reduce the computation of both time of linguistic analysis and term frequency analysis, while retaining an accuracy that is higher than that of term frequency analysis.
I.2.7Artificial IntelligenceNatural Language Processing[Text Analysis]
I.5.2Pattern RecognitionDesign Methodology[Classifier design and evaluation]
H.3.1Information Storage and RetrievalContent Analysis and Indexing Design, Experimentation, Performance
Genre classification, term frequency, linguistic
Queries submitted to search engines rarely contain information about the desired document genre. An example query for the term Robert Ludlum, returns documents that range from biographies and interviews to online shops. Guided navigation, through the use of genre classification, may increase the relevancy of search results as it provides to the user the ability to associate a genre with the search terms that they provided.
Genre classification groups a set of documents into smaller sets according to some predefined genre classes. Genre classification differs from text classification as it discriminates between the style of the documents as opposed to the latter which discriminates between the topic of the documents. Text classification techniques typically use the frequency of terms in the documents to discriminate between documents of different topics. Intuitively, documents on the same or similar topics will contain certain terms in common more frequently than documents on other topics. Similar techniques have been tried for classifying genre with varying degrees of success.
Previous classification techniques have either used term frequency analysis [3,4] or a more robust linguistic approach that involves POS(Part of Speech) tagging . While it has been observed in previous literature [4,5] that these linguistic approaches do provide better accuracy than term frequency approaches, it has also been observed by Kessler et al  that the POS tagging requires significant computational time. Our experiments as shown in Table 1 show that POS tagging using the Brill POS tagger  contributes to of the total time spent classifying a document.
|analysis of variables||13||2.8|
In this paper a technique is presented that is based on the Karlgren and Cutting  algorithm. It does away with POS tagging by approximating some POS features that are critical to the accuracy of the classifier. It is shown that this method returns more accurate results than term frequency based techniques, and for a small sacrifice in accuracy provides two orders of magnitude greater speed performance than POS based linguistic analysis. It is worth noting that no claim is made about the innovative nature of this approach in this paper. It has been selected to demonstrate the computational overhead POS tagging introduces to linguistic classification techniques for little accuracy benefit, while demonstrating that even approximations to POS gives superior results to term frequency classification techniques.
The present participle frequency was approximated by selecting all words with a length greater than 5 characters and ending with the suffix -ing. Adverb frequency was approximated by selecting all words of length greater than 4 and ending with the string -ly, as well as using a list of the 50 most common adverbs to appear in a training corpus as determined by the Brill POS tagger.
Genre classification experiments were run over a random sample of documents from the genres: editorial, reportage, scientific and speeches. This test was used to determine if the reduced linguistic approach had the same difficulty as the term frequency based approach when classifying documents between genres which are not distinct. Both the reports and editorial were from the same source to ensure that the stylistic differences between various sources could not be used to discriminate between the genres. The algorithm in this paper was compared to Karlgren and Cutting’s algorithm and a simple term frequency based algorithm that used the most common words in the training corpus as features.
All experiments were run on a dual CPU 2 GHz AMD Opteron 64 system running Linux with 8 GB of memory. In all experiments only one processor was used, so the results obtained should be comparable to results that would be obtained on a single CPU system.
Table 2 shows the number of documents that each method could classify per second. It shows that removing the POS variables from the linguistic techniques significantly improves the number of documents that can be classified each second. It also shows that the reduced linguistic approach does have a better runtime than the term frequency approach, however this may be caused by the implementation of the algorithm as the runtimes are fairly similar.
|POS based linguistic||1.6|
|Reduced linguistic with adverbs||238|
A comparison of the C4.5 decision trees produced by the reduced linguistic approach and the POS approach shows that they are fairly similar. Any differences in accuracy can be attributed to the inaccuracy of the adverb approximating, or from not considering the noun linguistic feature.
The experiments also show that it may be possible to produce a genre classification algorithm that has is accurate and efficient enough to be applied to large collections of documents. This has many potential applications both on the Web at large and within enterprises.
Future work may look at increasing the accuracy of classification by using techniques to rapidly detect nouns in a document, and techniques to improve the detection of adverbs, at the cost of some computation time. Further work is also planned to measure how hypertext information can improve the classification accuracy for web collections