IRMA-International.org: Creator of Knowledge
Information Resources Management Association
Advancing the Concepts & Practices of Information Resources Management in Modern Organizations

Prediction for Compound Activity in Large Drug Datasets Using Efficient Machine Learning Approaches

Prediction for Compound Activity in Large Drug Datasets Using Efficient Machine Learning Approaches
View Free PDF
Author(s): Larry J. Layne (University of New Mexico, USA) and Shibin Qiu (University of New Mexico, USA)
Copyright: 2005
Pages: 5
Source title: Managing Modern Organizations Through Information Technology
Source Editor(s): Mehdi Khosrow-Pour (Information Resources Management Association, USA)
DOI: 10.4018/978-1-59140-822-2.ch014

Abstract

Modern drug design requires activity prediction within a large number of chemical compounds using their descriptors that are often generated with high-noise in high-dimensional space. Both computational performance and classification quality face great challenges if machine learning algorithms are to be applied successfully. For computational efficiency, we implement the proximal support vector machine (PSVM) since it only depends on linear operations and can be trained faster than support vector machines (SVM) using quadratic optimization. For even larger datasets, we use parallel computing to make the training and classification time acceptable. To improve the classification quality, we implement and compare the SVM, k-nearest neighbor, decision tree and the naive Bayes classifiers. We measure the classification qualities by using the cross-validation accuracies, generalization accuracies, and the false positive and false negative ratios in ROC (receiver operating characteristics) curves. We also conduct feature selection in order to find the most important features and gain insights into the nature of the descriptors of the compounds. Features are easy to select using linear SVMs but the selection may be biased. We use a nonlinear kernel SVM in the feature selection process to achieve a higher ranking quality. To fully understand the properties of the noisy features in the dataset, we experiment with different number of features using the SVM classifier to obtain an optimal number of features.

Body Bottom