Using the Text Categorization Framework for Protein Classification

View Sample PDF

Author(s): Ricco Rakotomalala (University of Lyon, France)and Faouzi Mhamdi (University of Jandouba, Tunisia)
Copyright: 2009
Pages: 13
Source title: Handbook of Research on Text and Web Mining Technologies
Source Author(s)/Editor(s): Min Song (New Jersey Institute of Technology, USA)and Yi-Fang Brook Wu (New Jersey Institute of Technology, USA)
DOI: 10.4018/978-1-59904-990-8.ch008

Keywords: Data Mining / Data Mining and Databases / Information Science Reference / Library & Information Science

Purchase

View Using the Text Categorization Framework for Protein Classification on the publisher's website for pricing and purchasing information.

Abstract

In this chapter, we are interested in proteins classification starting from their primary structures. The goal is to automatically affect proteins sequences to their families. The main originality of the approach is that we directly apply the text categorization framework for the protein classification with very minor modifications. The main steps of the task are clearly identified: we must extract features from the unstructured dataset, we use the fixed length n-grams descriptors; we select and combine the most relevant one for the learning phase; and then, we select the most promising learning algorithm in order to produce accurate predictive model. We obtain essentially two main results. First, the approach is credible, giving accurate results with only 2-grams descriptors length. Second, in our context where many irrelevant descriptors are automatically generated, we must combine aggressive feature selection algorithms and low variance classifiers such as SVM (Support Vector Machine).

The IRMA Community

Research IRM

Using the Text Categorization Framework for Protein Classification

Purchase

Abstract

Related Content

IRMA Sponsors