IRMA-International.org: Creator of Knowledge
Information Resources Management Association
Advancing the Concepts & Practices of Information Resources Management in Modern Organizations

Successes and New Directions in Data Mining

Successes and New Directions in Data Mining
Author(s)/Editor(s): Pascal Poncelet (Ecole des Mines d'Ales, France), Florent Masseglia (Project AxIS-INRIA, France)and Maguelonne Teisseire (Universite Montpellier, France)
Copyright: ©2008
DOI: 10.4018/978-1-59904-645-7
ISBN13: 9781599046457
ISBN10: 1599046458
EISBN13: 9781599046471

Purchase

View Successes and New Directions in Data Mining on the publisher's website for pricing and purchasing information.


Description

The problem of mining patterns is becoming a very active research area and efficient techniques have been widely applied to problems in industry, government, and science. From the initial definition and motivated by real-applications, the problem of mining patterns not only addresses the finding of itemsets but also more and more complex patterns.

Successes and New Directions in Data Mining addresses existing solutions for data mining, with particular emphasis on potential real-world applications. Capturing defining research on topics such as fuzzy set theory, clustering algorithms, semi-supervised clustering, modeling and managing data mining patterns, and sequence motif mining, this book is an indispensable resource for library collections.



Preface

Since its definition, a decade ago, the problem of mining patterns is becoming a very active research area and efficient techniques have been widely applied to problems either in industry, government or science. From the initial definition and motivated by real-applications, the problem of mining patterns not only addresses the finding of itemsets but also more and more complex patterns. For instance, new approaches need to be defined for mining graphs or trees in applications dealing with complex data such as XML documents, correlated alarms or Biological networks. As the number of digital data is always growing, the problem of the efficiency of mining such patterns becomes more and more attractive.

One of the first areas dealing with a large collection of digital data is probably text mining. It aims at analyzing large collections of unstructured documents with the purpose of extracting interesting, relevant and non trivial knowledge. However, patterns become more and more complex and led to open problems. For instance, in the biological networks context, we have to deal with common patterns of cellular interactions, organization of functional modules, relationships and interaction between sequences, and patterns of genes regulation. In the same way, multi-dimensional pattern mining has also been defined and lot of open questions remains according to the size of the search space or to effectiveness consideration. If we consider Social network in the Internet, we would like to better understand and measuring relationships and flows between people, groups and organizations. Many real-world applications data are no more appropriately handled by traditional static databases since data arrives sequentially in the form of continuous rapid streams. Since data-streams are contiguous, high speed and unbounded, it is impossible to mine patterns by using traditional algorithms requiring multiple scans and new approaches have to proposed.

In order to efficiently aid decision making and for effectiveness consideration, constraints become more and more essential in many applications. Indeed, an unconstrained mining can produce such a large number of patterns that it may be intractable in some domains. Furthermore, the growing consensus that the end user is no more interested by a set of all patterns verifying selection criteria led to demand for novel strategies for extracting useful, even approximate knowledge.

The goal of this book is to provide theoretical frameworks and present challenges and their possible solutions concerning knowledge extraction. It aims at providing an overall view of the recent existing solutions for data mining with a particular emphasis on the potential real-world applications. It is composed of thirteen chapters.

The first chapter, by Eyke Hüllermeier explains “Why Fuzzy Set Theory is Useful in Data Mining”. It is important to see how much fuzzy theory may solve problems related to data mining when dealing with real applications, real data and real needs to understand the extracted knowledge. Actually, data mining applications have well known drawbacks, such as the high number of results, the “similar but hidden” knowledge or a certain amount of variability or noise in the data (a point of critical importance in many practical application fields). In this chapter, Eyke gives an overview of fuzzy sets and then demonstrates the advantages and robustness of fuzzy data mining. This chapter highlights these advantages in the context of exemplary data mining methods, but also points out some additional complications that can be caused by fuzzy extensions.

Web and XML data are two major fields of applications for data mining algorithms today. Web mining is usually a first step towards Web personalization and XML mining will become a standard since XML data is gaining more and more interest. Both domains share the huge amount of data to analyze and the lack of structure of their sources. The following three chapters provide interesting solutions and cutting hedge algorithms in that context.

In “SeqPAM: A Sequence Clustering Algorithm for Web Personalization”, Pradeep Kumar, Raju S. Bapi and P. Radha Krishna propose SeqPAM, an efficient clustering algorithm for sequential data and its application to Web personalization. Their proposal is based on pilot experiments comparing the results of PAM, a standard clustering algorithm, with two similarity measures: Cosine and S3M. The goodness of the clusters resulting from both the measures was computed using a cluster validation technique based on average levensthein distance.

XML is a rather verbose representation of semistructured data, which may require huge amounts of storage space. Several summarized representations of XML data have been proposed, which can both provide succinct information and be directly queried. In “Using Mined Patterns for XML Query Answering”, Elena Baralis, Paolo Garza, Elisa Quintarelli and Letizia Tanca focus on compact representations based on the extraction of association rules from XML datasets. In particular, they show how patterns can be exploited to (possibly partially) answer queries, either when fast (and approximate) answers are required, or when the actual dataset is not available (e.g., it is currently unreachable).

The problem of semi-supervised clustering (SSC) has been attracting a lot of attention at the research community. “On the Usage of Structural Information in Constrained Semi-Supervised Clustering of XML Documents” by Eduardo Bezerra, Geraldo Xexéo and Marta Mattoso is a chapter considering the problem of constrained clustering of documents. The authors consider the existence of a particular form of information to be clustered: textual documents that present a logical structure represented in XML format. Based on this consideration, we present algorithms that take advantage of XML metadata (structural information), thus improving the quality of the generated clustering models. The authors take as a starting point existing algorithms for semi-supervised clustering documents and then present a constrained semi-supervised clustering approach for XML documents, and deal with the following main concern: how can a user take advantage of structural information related to a collection of XML documents in order to define constraints to be used in the clustering of these documents?

The next chapter deals with pattern management problems related to data mining. Clusters, frequent itemsets, and association rules are some examples of common data mining patterns. The trajectory of a moving object in a localizer control system or the keyword frequency in a text document represent other examples of patterns. Patterns’ structure can be highly heterogeneous, they can be extracted from raw data but also known by the users and used for example to check how well some data source is represented by them and it is important to determine whether existing patterns, after a certain time, still represent the data source they are associated with. . Finally, independently from their type, all patterns should be manipulated and queried through ad hoc languages. In “Modeling and Managing Heterogeneous Patterns: the PSYCHO Experience”, Anna Maddalena and Barbara Catania present a system prototype providing an integrated environment for generating, representing, and manipulating heterogeneous patterns, possibly user-defined. After presenting the logical model and architecture, the authors focus on several examples of its usage concerning common market basket analysis patterns, i.e. association rules and clusters.

Biology is one of the most promising domains. In fact, it has been widely addressed by researchers in data mining those past few years and still has many open problems to offer (and to be defined). The next two chapters deal with sequence motif mining over protein base such as Swiss Prot and with the biochemical information resulting from metabolite analysis.

Proteins are biological macromolecules involved in all biochemical functions in the life of the cell and they are composed of basic units called amino-acids. Twenty different types of amino-acids exist, all with well differentiated structural and chemical properties. Protein sequence motifs describe regions of amino-acids that have been conserved across several functionally related proteins. These regions may have an implication at the structural and functional level of the proteins. Sequence motif mining can bring significant improvements towards a better understanding of the protein sequence-structure-function relation. In “Deterministic Motif Mining in Protein Databases”, P. G. Ferreira and P.J. Azavedo go deeper in the problem by first characterizing two types of extracted patterns and focus on deterministic patterns. They show that three measures of interest are suitable for such patterns and they illustrate through real applications that better understanding of the sequences under analysis have a wide range of applications. Finally they described the well known existing motif databases over the world.

C. Baumgartner and A. Graber in “Data Mining and Knowledge Discovery in Metabolomics” address chemical fingerprints reflecting metabolic changes related to disease onset and progression (i.e., Metabolomic mining or profiling). The biochemical information resulting from metabolite analysis reveals functional endpoints associated with physiological and pathophysiological processes, influenced by both genetic predisposition and environmental factors such as nutrition exercise or medication. In recent years, advanced data mining and bioinformatics techniques have been applied to increasingly comprehensive and complex metabolic data sets, with the objective to identify and verify robust and generalizable markers that are biochemically interpretable and biologically relevant in the context of the disease. In this Chapter, authors provide essential to understanding the complexity of data generation, and information on data mining principals, specific methods and processes, and biomedical application.

The exponential growth of multimedia data in consumer as well as scientific applications poses many interesting and task critical challenges. There are several inter-related issues in the management of such data, including feature extraction, multimedia data relationships, or other patterns not explicitly stored in multimedia databases, similarity based search, scalability to large data sets, and personalizing search and retrieval. The two following chapters address multimedia data.

In “Handling Local Patterns in Collaborative Structuring”, I. Mierswa, K. Morik and M. Wurst address the problem of structuring personal media collection of data by using collaborative and data mining (machine learning) approaches. Usually personal media collections are locally structured in very different ways by different users. The main problem in this case is to know if data mining techniques could be useful for automatically structuring personal collections by considering local structures. They propose a uniform description of learning tasks which starts with a most general, generic learning task and is then specialized to the known learning tasks and then address how to solve the new learning task. The proposed approach uses in a distributed setting is exemplified by the application to collaborative media organization in a peer-to-peer network.

M. Bouet, P. Gançarski, M.-A. Aufaure and O. Boussaïd in “Pattern Mining and Clustering on Image Databases“ focus on image data. In an image context, databases are very large since they contain strongly heterogeneous data, often not structured and possibly coming from different sources within different theoretical or applicative domains (pixel values, image descriptors, annotations, trainings, expert or interpreted knowledge, etc.). Besides, when objects are described by a large set of features, many of them are correlated, while others are noisy or irrelevant. Furthermore, analysing and mining these multimedia data to derive potentially useful information is not easy. The authors propose a survey of the relevant research related to image data processing and present data warehouse advances that organize large volumes of data linked with images. The rest of the chapter deals with two techniques largely used in data mining: clustering and pattern mining. They show how clustering approaches could be applied to image analysis and they highlight that there is little research dealing with image frequent pattern mining. They thus introduce the new research direction concerning pattern mining from large collections of images.

In the previous chapter we have seen that in an image context, we have to deal with very large databases since they contain strongly heterogeneous data. In “Semantic Integration and Knowledge Discovery for Environmental Research”, proposed by Z. Chen, A. Gangopadhyay, G. Karabatis, M. McGuire and C. Welty we also address very large databases but in a different context. The urban environment is formed by complex interactions between natural and human systems. Studying the urban environment requires the collection and analysis of very large datasets, having semantic (including spatial and temporal) differences and interdependencies, being collected and managed by multiple organizations, and being stored in varying formats. In this chapter, the authors introduce a new approach to integrate urban environmental data and provide scientists with semantic techniques to navigate and discover patterns in very large environmental datasets.

In the chapter “Visualizing Multi Dimensional Data”, C. García-Osorio and C. Fyfe focus on the visualisation of multi dimensional data. This chapter is based on the following assertion: finding information within the data is often an extremely complex task and even if the computer is very good at handling large volumes of data and manipulating such data in an automatic manner, humans are very good at pattern identification much better indeed than computers. They thus focus on visualization techniques when the number of attributes to represent is higher than three. They start with a short description of some taxonomies of visualization methods, and then present their vision of the field. After they explain in detail each class in their classification emphasizing some of the more significant visualization methods belonging to that class. Finally, they give giving a list of some of the software tools for data visualization freely available on the Internet.

Intense work in the area of data mining technology and in its applications to several domains has resulted in the development of a large variety of techniques and tools able to automatically and intelligently transform large amounts of data in knowledge relevant to users. However, as with other kinds of useful technologies, the knowledge discovery process can be misused. In “Privacy Preserving Data Mining, Concepts, Techniques and Evaluation Methodologies”, I. Nai Fovino addresses a new challenging problem: how to preserve privacy when applying data mining methods. He proposes to study privacy preserving problem under the data mining perspective as well as a taxonomy criteria allowing giving a constructive high level presentation of the main privacy preserving data mining approaches. He also focuses on a unified evaluation framework.

Many recent real-world applications, such as network traffic monitoring, intrusion detection systems, sensor network data analysis, click stream mining and dynamic tracing of financial transactions, call for studying a new kind of data. Called stream data, this model is, in fact, a continuous, potentially infinite flow of information as opposed to finite, statically stored data sets extensively studied by researchers of the data mining community. H. Abdulsalam, D.B. Skillicorn and P. Martin in the chapter “Mining Data-Streams” focus on, three online mining of data streams namely, summarization techniques, prediction techniques, and clustering techniques, and show the research work in the area. In each section they conclude with a comparative analysis of the major work in the area.

More...
Less...

Reviews and Testimonials

The goal of this book is to provide theoretical frameworks and present challenges and their possible solutions concerning knowledge extraction. It aims at providing an overall view of the recent existing solutions for data mining with a particular emphasis on the potential real-world applications.

– Florent Masseglia, Projet AxIS-INRIA, France

The present coverage documents the successful research endeavors in data mining today. This work is an excellent addition to research collections.

– CHOICE, Vol. 45, No. 09 (May 2008)

The book will be useful as a reference for researchers, practitioners, and students in fields related to data mining and data warehousing.

– Book News Inc. (2008)

The book presents chapters that are not only relevant to the data mining research community but also, in cases, introductory to new and necessary fields of reach pointing whenever possible future trend.

– Renato Cordeiro de Amorim, University of London, UK

Author's/Editor's Biography

Pascal Poncelet (Ed.)
Pascal Poncelet is a professor and the head of the data mining research group in the computer science department at the Ecole des Mines d’Alès in France. He is also co-head of the department. Professor Poncelet has previously worked as lecturer (1993-1994), as associate professor, respectively, in the Méditerranée University (1994-1999) and Montpellier University (1999 2001). His research interest can be summarized as advanced data analysis techniques for emerging applications. He is currently interested in various techniques of data mining with application in Web mining and text mining. He has published a large number of research papers in refereed journals, conference, and workshops, and been reviewer for some leading academic journals. He is also co-head of the French CNRS Group “I3” on data mining.

Florent Masseglia (Ed.)
Florent Masseglia is currently a researcher for INRIA (Sophia Antipolis, France). He did research work in the Data Mining Group at the LIRMM (Montpellier, France) (1998-2002) and received a PhD in computer science from Versailles University, France (2002). His research interests include data mining (particularly sequential patterns and applications such as Web usage mining) and databases. He is a member of the steering committees of the French working group on mining complex data and the International Workshop on Multimedia Data. He has co-edited several special issues about mining complex or multimedia data. He also has co-chaired workshops on mining complex data and co-chaired the 6th and 7th editions of the International Workshop on Multimedia Data Mining in conjunction with the KDD conference. He is the author of numerous publications about data mining in journals and conferences and he is a reviewer for international journals.

Maguelonne Teisseire (Ed.)
Maguelonne Teisseire received a PhD in computing science from the Méditerranée University, France (1994). Her research interests focused on behavioral modeling and design. She is currently an assistant professor of computer science and engineering in Montpellier II University and Polytech’Montpellier, France. She is head of the Data Mining Group at the LIRMM Laboratory, Montpellier. Her interests focus on advanced data mining approaches when considering that data are time ordered. Particularly, she is interested in text mining and sequential patterns. Her research takes part on different projects supported by either National Government (RNTL) or regional projects. She has published numerous papers in refereed journals and conferences either on behavioral modeling or data mining.

More...
Less...

Body Bottom