Support vector machines (SVMs) have been
promising methods for classification and regression analysis due to their solid
mathematical
foundations, which include two desirable properties: margin maximization and
nonlinear classification using kernels. However, despite these prominent
properties, SVMs are usually not chosen for large-scale data mining problems
because their training complexity is highly dependent on the data set size.
Unlike traditional pattern recognition and machine learning, real-world data
mining applications often involve a huge number of data records that does not
fit in main memory and a multiple scans of the data set is often too expensive.
Our Clustering-Based SVM (CB-SVM) maximizes the SVM performance for very large data sets given a limited amount of memory. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples. These samples carry statistical summaries of the data and maximize the benefit of learning.
H. Yu, J. Yang, J. Han & X. Li, "Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing", Data Mining and Knowledge Discovery, Springer, 11(3): 295-321, 2005. (DAMI'05) [pdf]
H. Yu, J. Yang & J. Han, "Classifying Large Data Sets Using SVM with Hierarchical Clusters", Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2003. (KDD'03 full paper, 13% accepted, received student scholarship award) [pdf]
DM - Data Mining Lab, Department of Computer Science and Engineering, Pohang University of Science and Technology
Copyright (c) 2008-2009 POSTECH Data Mining Lab, All Rights Reserved.