Web Mining in Thematic Search Engines

View Sample PDF

Author(s): Massimiliano Caramia (Istituto per le Applicazioni del Calcolo IAC-CNR, Italy)and Giovanni Felici (Istituto di Analisi dei Sistemi ed Informatica (IASI-CNR), Italy)
Copyright: 2005
Pages: 5
Source title: Encyclopedia of Data Warehousing and Mining
Source Author(s)/Editor(s): John Wang (Montclair State University, USA)
DOI: 10.4018/978-1-59140-557-3.ch226

Keywords: Data Mining and Databases / Data Warehousing / Information Science Reference / Library & Information Science

Purchase

View Web Mining in Thematic Search Engines on the publisher's website for pricing and purchasing information.

Abstract

The recent improvements of search engine technologies have made available to Internet users an enormous amount of knowledge that can be accessed in many different ways. The most popular search engines now provide search facilities for databases containing billions of Web pages, where queries are executed instantly. The focus is switching from quantity (maintaining and indexing large databases of Web pages and quickly selecting pages matching some criterion) to quality (identifying pages with a high quality for the user). Such a trend is motivated by the natural evolution of Internet users who are now more selective in their choice of the search tool and may be willing to pay the price of providing extra feedback to the system and to wait more time for their queries to be better matched. In this framework, several have considered the use of data-mining and optimization techniques, which are often referred to as Web mining (for a recent bibliography on this topic, see, e.g., Getoor, Senator, Domingos & Faloutsos, 2003), and Zaïane, Srivastava, Spiliopoulou, & Masand, 2002). Here, we describe a method for improving standard search results in a thematic search engine, where the documents and the pages made available are restricted to a finite number of topics, and the users are considered to belong to a finite number of user profiles. The method uses clustering techniques to identify, in the set of pages resulting from a simple query, subsets that are homogeneous with respect to a vectorization based on context or profile; then we construct a number of small and potentially good subsets of pages, extracting from each cluster the pages with higher scores. Operating on these subsets with a genetic algorithm, we identify the subset with a good overall score and a high internal dissimilarity. This provides the user with a few nonduplicated pages that represent more correctly the structure of the initial set of pages. Because pages are seen by the algorithms as vectors of fixed dimension, the role of the context- or profile-based vectorization is central and specific to the thematic approach of this method.

The IRMA Community

Research IRM

Web Mining in Thematic Search Engines

Purchase

Abstract

Related Content

IRMA Sponsors