IRMA-International.org: Creator of Knowledge
Information Resources Management Association
Advancing the Concepts & Practices of Information Resources Management in Modern Organizations

A Data Mining Driven Approach for Web Classification and Filtering Based on Multimodal Content Analysis

A Data Mining Driven Approach for Web Classification and Filtering Based on Multimodal Content Analysis
View Sample PDF
Author(s): Mohamed Hammami (Faculté des Sciences de Sfax, Tunisia), Youssef Chahir (Université de Caen, France)and Liming Chen (Ecole Centrale de Lyon, France)
Copyright: 2007
Pages: 35
Source title: Business Data Communications and Networking: A Research Perspective
Source Author(s)/Editor(s): Jairo Gutierrez (University of Auckland, NZ)
DOI: 10.4018/978-1-59904-274-9.ch002

Purchase

View A Data Mining Driven Approach for Web Classification and Filtering Based on Multimodal Content Analysis on the publisher's website for pricing and purchasing information.

Abstract

Along with the ever growing Web is the proliferation of objectionable content, such as sex, violence, racism, and so forth. We need efficient tools for classifying and filtering undesirable Web content. In this chapter, we investigate this problem through WebGuard, our automatic machine-learning-based pornographic Web site classification and filtering system. Facing the Internet more and more visual and multimedia as exemplified by pornographic Web sites, we focus here our attention on the use of skin color-related visual content-based analysis along with textual and structural content based analysis for improving pornographic Web site filtering. While the most commercial filtering products on the marketplace are mainly based on textual content-based analysis such as indicative keywords detection or manually collected black list checking, the originality of our work resides on the addition of structural and visual content-based analysis to the classical textual content-based analysis along with several major-data mining techniques for learning and classifying. Experimented on a test bed of 400 Web sites including 200 adult sites and 200 nonpornographic ones, WebGuard, our Web filtering engine scored a 96.1% classification accuracy rate when only textual and structural content based analysis are used, and 97.4% classification accuracy rate when skin color-related visual content-based analysis is driven in addition. Further experiments on a black list of 12,311 adult Web sites manually collected and classified by the French Ministry of Education showed that WebGuard scored 87.82% classification accuracy rate when using only textual and structural content-based analysis, and 95.62% classification accuracy rate when the visual content-based analysis is driven in addition. The basic framework of WebGuard can apply to other categorization problems of Web sites which combine, as most of them do today, textual and visual content.

Related Content

Taoufik Benyetho, Larbi El Abdellaoui, Abdelali Tajmouati, Abdelwahed Tribak, Mohamed Latrach. © 2017. 33 pages.
Naveen Jaglan, Samir Dev Gupta, Binod Kumar Kanaujia, Shweta Srivastava. © 2017. 51 pages.
Anirban Karmakar. © 2017. 30 pages.
Hassan Elmajid, Jaouad Terhzaz, Hassan Ammor. © 2017. 31 pages.
Salvatore Caorsi, Claudio Lenzi. © 2017. 23 pages.
Abdessamed Chinig, Ahmed Errkik, Abdelali Tajmouati, Hamid Bennis, Jamal Zbitou, Mohamed Latrach. © 2017. 35 pages.
Fouad Aytouna, Mohamed Aghoutane, Naima Amar Touhami, Mohamed Latrach. © 2017. 39 pages.
Body Bottom