On Document Representation and Term Weights in Text Classification

View Sample PDF

Author(s): Ying Liu (The Hong Kong Polytechnic University Hong Kong SAR, China)
Copyright: 2009
Pages: 22
Source title: Handbook of Research on Text and Web Mining Technologies
Source Author(s)/Editor(s): Min Song (New Jersey Institute of Technology, USA)and Yi-Fang Brook Wu (New Jersey Institute of Technology, USA)
DOI: 10.4018/978-1-59904-990-8.ch001

Keywords: Data Mining / Data Mining and Databases / Information Science Reference / Library & Information Science

Purchase

View On Document Representation and Term Weights in Text Classification on the publisher's website for pricing and purchasing information.

Abstract

In the automated text classification, a bag-of-words representation followed by the tfidf weighting is the most popular approach to convert the textual documents into various numeric vectors for the induction of classifiers. In this chapter, we explore the potential of enriching the document representation with the semantic information systematically discovered at the document sentence level. The salient semantic information is searched using a frequent word sequence method. Different from the classic tfidf weighting scheme, a probability based term weighting scheme which directly reflect the term’s strength in representing a specific category has been proposed. The experimental study based on the semantic enriched document representation and the newly proposed probability based term weighting scheme has shown a significant improvement over the classic approach, i.e., bag-of-words plus tfidf, in terms of Fscore. This study encourages us to further investigate the possibility of applying the semantic enriched document representation over a wide range of text based mining tasks.

The IRMA Community

Research IRM

On Document Representation and Term Weights in Text Classification

Purchase

Abstract

Related Content

IRMA Sponsors