The IRMA Community
Newsletters
Research IRM
Click a keyword to search titles using our InfoSci-OnDemand powered search:
|
Semi-Structured Document Classification
|
Author(s): Ludovic Denoyer (University of Paris VI, France)and Patrick Gallinari (University of Paris VI, France)
Copyright: 2005
Pages: 7
Source title:
Encyclopedia of Data Warehousing and Mining
Source Author(s)/Editor(s): John Wang (Montclair State University, USA)
DOI: 10.4018/978-1-59140-557-3.ch191
Purchase
|
Abstract
Document classification developed over the last 10 years, using techniques originating from the pattern recognition and machine-learning communities. All these methods operate on flat text representations, where word occurrences are considered independents. The recent paper by Sebastiani (2002) gives a very good survey on textual document classification. With the development of structured textual and multimedia documents and with the increasing importance of structured document formats like XML, the document nature is changing. Structured documents usually have a much richer representation than flat ones. They have a logical structure. They are often composed of heterogeneous information sources (e.g., text, image, video, metadata, etc.). Another major change with structured documents is the possibility to access document elements or fragments. The development of classifiers for structured content is a new challenge for the machine-learning and IR communities. A classifier for structured documents should be able to make use of the different content information sources present in an XML document and to classify both full documents and document parts. It should adapt easily to a variety of different sources (e.g., different document type definitions). It should be able to scale with large document collections.
Related Content
Md Sakir Ahmed, Abhijit Bora.
© 2024.
15 pages.
|
Lakshmi Haritha Medida, Kumar.
© 2024.
18 pages.
|
Gypsy Nandi, Yadika Prasad.
© 2024.
16 pages.
|
Saurav Bhattacharjee, Sabiha Raiyesha.
© 2024.
14 pages.
|
Naren Kathirvel, Kathirvel Ayyaswamy, B. Santhoshi.
© 2024.
26 pages.
|
K. Sudha, C. Balakrishnan, T. P. Anish, T. Nithya, B. Yamini, R. Siva Subramanian, M. Nalini.
© 2024.
25 pages.
|
Sabiha Raiyesha, Papul Changmai.
© 2024.
28 pages.
|
|
|