IRMA-International.org: Creator of Knowledge
Information Resources Management Association
Advancing the Concepts & Practices of Information Resources Management in Modern Organizations

Next Generation Search Engines: Advanced Models for Information Retrieval

Next Generation Search Engines: Advanced Models for Information Retrieval
Author(s)/Editor(s): Christophe Jouis (Universite Paris III, France and LIP6-Universite Pierre et Marie Curie, France), Ismail Biskri (Universite du Quebec A Trois Rivieres, Canada), Jean-Gabriel Ganascia (LIP6 and CNRS-Universite Pierre et Marie Curie, France)and Magali Roux (LIP6 and CNRS-Universite Pierre et Marie Curie, France)
Copyright: ©2012
DOI: 10.4018/978-1-4666-0330-1
ISBN13: 9781466603301
ISBN10: 1466603305
EISBN13: 9781466603318

Purchase

View Next Generation Search Engines: Advanced Models for Information Retrieval on the publisher's website for pricing and purchasing information.


Description

Recent technological progress in computer science, Web technologies, and the constantly evolving information available on the Internet has drastically changed the landscape of search and access to information. Current search engines employ advanced techniques involving machine learning, social networks, and semantic analysis.

Next Generation Search Engines: Advanced Models for Information Retrieval is intended for scientists and decision-makers who wish to gain working knowledge about search in order to evaluate available solutions and to dialogue with software and data providers. The book aims to provide readers with a better idea of the new trends in applied research.



Preface

NEEDS AND REQUIREMENTS FOR INFORMATION RETRIEVAL

Scientific and economic organizations are confronted with handling an abundance of strategic information in their domain activities. One main challenge is to be able to find the right information quickly and accurately. In order to do so, organizations must master information access: getting relevant query results that are organized, sorted, and actionable.

As noted by Mukhopadhyay and Mukhopadhyay (2004), almost everyone agrees that in the current state of the art on Internet search engine technology, extracting information from the Web is an art itself. Almost all commercial search engines use classical keyword-based methods for information retrieval (IR). That means that they try to match user specified patterns (i.e., queries) to the texts of all documents in their database and then return the documents that contain terms matching the query. Such methods are quite effective for well-controlled collections - such as bibliographic CD-ROMs or handcrafted scientific information repositories. Unfortunately the organization of the Internet has not been rationally supervised, but it has rather spontaneously evolved and, therefore, cannot be treated as a well-controlled collection. It contains a lot of garbage and redundant information and, what is maybe even more important, it does not rely on any underlying semantic structure intended to facilitate navigation.

In addition, some of the current issues result from inappropriate query constructions. The user queries that are usually submitted to search engines are often too general (like “water sources” or “capitals”) and this produces millions of returned documents. The results, which are of interest to users, are probably among them, but they cannot be distinguished from the mass; it appears impossible to emphasize them to the human attention. One hundred documents are generally regarded as the maximum amount of information that can be useful to users in such situations.

On the other hand, some documents cannot be retrieved because the specified pattern does not exactly match. This can be caused by flexion in some languages, or by confusion introduced by synonyms and complex idiom structures (e.g., in English the word Mike is often given as an example of this, as it can be used as a male name or as a shortened form for the noun “microphone”). Most search engines have also very poor user interfaces. Computer-aided query constructions are very rare and the presentation of the search results concentrates mostly on individual documents, but it does not provide any general overview of retrieved data, which is crucial when the number of returned documents is huge. A last group of problems comes from the nature of information stored on the Internet. Search tools must not only deal with hypertext documents (in the form of WWW pages) but also with text repositories (message archives, e-books etc.), FTP and Usenet servers and with many sources of non-textual information such as audio, video, and interactive contents.

Recent technological progress in computer science, Web technologies, and constantly evolving information available on the Internet has drastically changed the landscape of search and access to information. Web search has significantly evolved in recent years. In the beginning, web search engines such as Google and Yahoo! were only providing search service over text documents. Aggregated search was one of the first steps to go beyond text search, and was the beginning of a new era for information seeking and retrieval. These days, new web search engines support aggregated search over a number of vertices, and blend different types of documents (e.g., images, videos) in their search results. New search engines employ advanced techniques involving machine learning, computational linguistics and psychology, user interaction and modeling, information visualization, Web engineering, artificial intelligence, distributed systems, social networks, statistical analysis, semantic analysis, and technologies over query sessions.

Documents no longer exist on their own; they are connected to other documents, they are associated with users and their position in a social network, and they can be mapped onto a variety of ontologies. Similarly, retrieval tasks have become more interactive and are solidly embedded in a user's geospatial, social, and historical context. It is conjectured that new breakthroughs in information retrieval will not come from smarter algorithms that better exploit existing information sources, but from new retrieval algorithms that can intelligently use and combine new sources of contextual metadata.

With the rapid growth of web-based applications, such as search engines, Facebook, and Twitter, the development of effective and personalized information retrieval techniques and of user interfaces is essential. The amount of shared information and of social networks has also considerably grown, requiring metadata for new sources of information, like Wikipedia and ODP. These metadata have to provide classification information for a wide range of topics, as well as for social networking sites like Twitter, and Facebook, each of which provides additional preferences, tagging information and social contexts. Due to the explosion of social networks and other metadata sources, it is an opportune time to identify ways to exploit such metadata in IR tasks such as user modeling, query understanding, and personalization, to name a few. Although the use of traditional metadata such as html text, web page titles, and anchor text is fairly well-understood, the use of category information, user behavior data, and geographical information is just beginning to be studied.

OBJECTIVES OF THE BOOK

The main goal of this book is to transfer new research results from the fields of advanced computer sciences and information science to the design of new search engines. The readers will have a better idea of the new trends in applied research. The achievement of relevant, organized, sorted, and workable answers – to name but a few – from a search is becoming a daily need for enterprises and organizations, and, to a greater extent, for anyone. It does not consist of getting access to structural information as in standard databases; nor does it consist of searching information strictly by way of a combination of key words. It goes far beyond that. Whatever its modality, the information sought should be identified by the topics it contains, that is to say by its textual, audio, video or graphical contents. This is not a new issue. However, recent technological advances have completely changed the techniques being used. New Web technologies, the emergence of Intranet systems and the abundance of information on the Internet have created the need for efficient search and information access tools.

TARGET AUDIENCE

This book is intended for scientists and decision-makers who wish to gain working knowledge of searches in order to evaluate available solutions and to dialogue with software and data providers. It also targets intranet or Web server designers, developers and administrators who wish to understand how to integrate search technology into their applications according to their needs. This book is further designed for designers, developers and administrators of databases, groupware applications and document management systems (EDM), as well as directors of libraries or documentation centers who seek a deeper understanding of the tools they use, and how to set up new information systems. Lastly, this book is aimed at all professionals in technology or competitive intelligence and, more generally, the specialists of the information market.

A BRIEF OVERVIEW OF THE ORGANIZATION OF THE BOOK

The book is divided into four sections.

Section 1 is “Indexation”. The goal of automatic indexing is to establish an index for a set of documents that has to facilitate future access to documents and to their content. Usually, an index is composed of a list of descriptors, each of them being associated to a list of documents and/or of parts of documents to which it refers. In addition, theses references may be weighted. When searching to answer the users' queries, the system looks for a list of answers, of which an index is as close as possible to the demand. As a consequence, indexation could be seen as a required preliminary to intelligent information retrieval, since it pre-structures textual data according to topic, domain, keyword or center of interest.

Section 2 is “Data Mining for Information Retrieval”. Data Mining (i.e., Knowledge Discovery from Data Bases) is the process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible patterns from large data sets. Data mining is a relatively young and interdisciplinary field that combines methods from statistics and artificial intelligence with database management. With the considerable increase of processing power, storage capacities, and inter-connectivity of computer technology, in particular with the grid computation, data mining is now seen as an increasingly important field by modern business for transforming unprecedented quantities of digital data into new knowledge that provides a significant competitive advantage. This is now a large part of what people refer to as business intelligence strategy. It is currently used in a wide range of profiling practices, such as marketing, surveillance, fraud detection, and scientific discovery. The growing consensus that data mining can bring real added value has led to an explosion in demand for novel data mining technologies.

Section 3 is “Interface”. The term "interface" refers to the part of the search engine in which (1) the user formulates his request and (2) the user reads the results. The interface is then seen in four views: Human-centered Web Search, Personalization, Question/Answering, and Mobile Search Engines. “Human-centered Web Search” is understood to be how Web search engines help people to find the information they are seeking. “Personalization” takes keywords from the user as an expression of their information need, but also uses additional information about the user (such as their preferences, community, location or history) to assist in determining the relevance of pages. “Question/Answering” addresses the problem of finding answers to questions posed in natural language; answering is the task which, when given a query in natural language, aims at finding one or more concise answers in the form of sentences or phrases. “Mobile Search Engines” may be defined as the combining of search technologies and knowledge about the user context in his mobile environment into a single framework in order to provide the most appropriate answer for users information needs.

Finally, Section 4 is “Evaluation”. Evaluation means two things: (1) tracing the users' behaviors, with a special attention to the concept of “information practice” and other related concepts such as “use”, “activity”, and “behavior” largely used in the literature but not always strictly defined, the aim being to place the users and their needs at the center of the design process; (2) evaluating the next generation search engines with four main criteria for improving the quality of the search results: index quality, quality of the results, quality of search features, and search engine usability.

Christophe Jouis, University Paris Sorbonne Nouvelle and LIP6 (UPMC & CNRS), France
Ismaïl Biskri, University of Quebec at Trois Rivieres, Canada
Jean-Gabriel Ganascia, LIP6, (UPMC & CNRS), France
Magali Roux, INIST and LIP6, (UPMC & CNRS), France

REFERENCE

Mukhopadhyay, B., & Mukhopadhyay, S. (2004, February 11-13). Data mining techniques for information retrieval. In Proceedings of the 2nd International Conference of the Convention on Automation of Libraries in Education and Research Institution, New Delhi, India (p. 506).
More...
Less...

Reviews and Testimonials

"This book is intended for scientists and decision-makers who wish to gain working knowledge of searches in order to evaluate available solutions and to dialogue with software and data providers. It also targets intranet or Web server designers, developers and administrators who wish to understand how to integrate search technology into their applications according to their needs. This book is further designed for designers, developers and administrators of databases, groupware applications and document management systems (EDM), as well as directors of libraries or documentation centers who seek a deeper understanding of the tools they use, and how to set up new information systems. Lastly, this book is aimed at all professionals in technology or competitive intelligence and, more generally, the specialists of the information market."

– Christophe Jouis, University Paris Sorbonne Nouvelle and LIP6 (UPMC & CNRS), France; Ismaïl Biskri, University of Quebec at Trois Rivieres, Canada; Jean-Gabriel Ganascia, LIP6, (UPMC & CNRS), France; and Magali Roux, INIST and LIP6, (UPMC &

This collection of 20 essays explores next-generation search engines, covering topics such as indexing, metadata, semantic models and search-engine interfaces. The essays offer a truly international perspective on developments in next-generation search engines. This book is comprehensively researched and detailed and is recommended for those involved in the development and implementation of information-retrieval models. It is also likely to be of interest to those wishing to learn more about next-generation search engines and the current trends in information-retrieval models.

– The Australian Library Journal, Vol. 62, No. 2 - Anne Sara, Sydney

This work is abundant in innovative ideas, new concepts, and real-world practices. The realization or implementation of the perspectives and models included in the book may result in real advances and substantive changes in information markets. Advanced students, academics and researchers, as well as knowledge workers, information and computer scientists and web designers, can benefit from reading this work.

– Alireza Isfandyari-Moghaddam, Islamic Azad University, Hamedan Branch, Online Information Review, Vol. 37, No. 3

Author's/Editor's Biography

Christophe Jouis (Ed.)
Christophe Jouis is assistant professor at the University Paris Sorbonne Nouvelle, France. He received a Ph.D. in Applied Mathematics at the “Ecole des Hautes Etudes en Sciences Sociales” (EHESS); and CAMS (“Centre d’Analyse et de Mathématiques Sociales”), OPTION: Science, Logic, Linguistics. From 2000 to 2004 he was associate professor in the Department of Computer Science at the University of Quebec at Trois-Rivieres (Canada), under the direction of Professor Ismail Biskri. In 2005, he joined the LIP6 ("Laboratoire d'Informatique de Paris 6), affiliated with the University Pierre et Marie Curie (UMPC) and the CNRS (France). Within the LIP6, he is currently a member of the research team ACASA (“Cognitive Agents and Automated Symbolic Learning”), led by Professor Jean-Gabriel Ganascia. His research interests are in natural language processing (NLP), cognitive sciences, ontology, typicality, data mining and information retrieval.

Ismail Biskri (Ed.)
Ismaïl Biskri is full professor in computational linguistics and artificial intelligence at the computer science department of the University of Quebec at Trois-Rivières. He is also associate professor at the Computer Science Department of the University of Quebec at Montreal. He is a researcher at the LAMIA Laboratory. His research interests concern aspects of fundamental research on the syntactic and functional semantic analysis of natural languages with using models of Categorial Grammars and combinatory logic. He also works on specific issues in text-mining, information retrieval, and terminology. His research is funded by the Canadian granting agencies FQRSC, SSHRC, and NSERC.

Jean-Gabriel Ganascia (Ed.)
Jean-Gabriel Ganascia is presently Professor of computer science at Paris University Pierre et Marie Curie (Paris VI) and researcher at the computer science laboratory of Paris VI University (LIP6) where he leads the ACASA (“Cognitive Agents and Automated Symbolic Learning”) team. He originally worked on symbolic machine learning and knowledge engineering. His “thèse d'état”, defended in 1987, was a pioneering work on the algebraic framework on which the association rule extraction techniques are based. Today, his main scientific interests cover different areas of artificial intelligence: scientific discovery, cognitive modeling, data-mining, and digital humanities. He has published more than 350 scientific papers in conference proceedings, journals, and books. In the past, Jean-Gabriel Ganascia was also program leader in the CNRS executive from 1988 to 1992 before moving to direct the Cognitive Science Coordinated Research Program and head the Cognition Sciences Scientific Interest Group from 1993 until 2000.

Magali Roux (Ed.)
Magali Roux is a CNRS Research Director involved in the development and administration of programs and courses in e-Biology. Her research interests span a wide range with domains centered on knowledge organization and data management in Medical Biology, Molecular Biology and, recently, in Systems Biology in the context of e-Sciences. After obtaining her Ph.D. in Biochemistry from the University of the Mediterranean in 1979, she started as assistant-professor at the Marseille University Hospital before being offered a post-doctoral position at Harvard University in the Pr. J. Strominger laboratory, where she provided one of the first bioinformatics analyses performed on DNA data. Since that, she has produced leading contributions in the fields of Immunology and Cancer. In the early 2000s, she moved from Experimental to Digital Biology to promote interoperability, data sharing and re-use. Dr. Roux serves on numerous study panels and is currently active in a number of scientific societies.

More...
Less...

Body Bottom