EVALUATION OF SIMILARITY MEASURES FOR CATEGORICAL DATA
Abstract
Similarity among the objects is a fundamental concept to almost all the technical field such as information retrieval; data mining; mathematics; and bioinformatics. A similarity measure symbolizes relation among the objects, which can be either, documents, queries or features of any database. Similarity measure helps to rank the objects in accordance to their importance in specific data mining application. A similarity measure is a function that computes the degree of similarity between a pair of objects. Similarity base applications are countless. Data mining is used to build the knowledge base of the large data repositories for human inferences and analysis. Data mining techniques are more frequent in all such technical fields where the similarity as' required. The proper selection of similarity or distance measure is a key to many data mining techniques such Clustering; Classification; and Outlier Detection. For categorical data, computation of similarity measure is a complex phenomenon. The measures used for continues data such as Euclidean Measures are generalized upto some extent and can be applied in any continues data domain. Euclidean measures are widely applied to categorical data without considering the domain knowledge and nature of categorical data. Due to the complex nature of categorical data, no standard measure like Euclidean is available in literature. In this paper, we are evaluating the different categorical measures in accordance with their usage in different data mining applications and techniques. We are also proposing the chi-fuzzy measure to address the categorical data issue.References
M.Y. Shih, J.W. Jheng and L. F. Lai,
Tamkang Journal of Science and
Engineering 13, No. 1 (2010) 11.
G. Fung, A Comprehensive Overview of
Basic Clustering Algorithms, www.cs.wisc.
edu/~gfung/clustering.ps.gz. (June 22, 2001).
H. Lu and T.T.S. Nguyen, Experimental
Investigation of PSO Based Web User
Session Clustering, International Conference
of Soft Computing and Pattern Recognition,
IEEE (2009).
Z. Ma and O.R.L. Sheng, Clustering Web
Session Using Extended General Pages,
Proceedings of 8th Pacific Asia Conference
on Information Systems, Shangia, China
(2004) p. 5.
L. Chaofeng, Research on Web Session
Clustering, Journal of Software 4, No. 5
(2009) 460.
T. Hussain, S. Asghar, and S. Fong, A
Hierarchical Cluster Based Preprocessing
Methodology for Web Usage Mining. 6th
International Conference on Advanced
Information Management and Service (IMS).
Seoul, Korea (2010).
S. Boriah, V. Chandola and V. Kumar,
Similarity Measure for Categorical Data: A
Comparative Evaluation, Proceedings of the
Eighth SIAM International Conference on
Data Mining (2008).
C.M. Nichele and K. Becker, Clustering Web
Sessions by Levels of Page Similarity
Springer-Verlag Berlin Heidelberg (2006)
pp. 346-350.
S. Aranganayagi, K. Thangavel and
S. Sujatha, New Distance Measure based on
the Domain for Categorical Data. ICAC, IEEE
(2009).
Z.C. Johanyak and S. Kovacs, Distance
Based Similarity Measure of Fuzzy Sets
(2004).
P.H.A. Sneath and R.R. Sokal, Numerical
Taxonomy: The Principles and Practice of
Numerical Classification, San Francisco: W.
H. Freeman and Company (1973).
D.W. Goodall, Biometrics, 22, No. 4 (1966)
A. Ahmad and A. Dey, ScienceDirect, Pattern
Recognition Letters 28 (2006) 110.
S.Q. Le and T.B. Ho, Elsevier 26 (2005)
F. Lourenco, V. Lobo and F. Bacao, BinaryBased Similarity Measures for Categorical
Data and Their Application in Self Organizing
Maps, JOCLAD 2004 - XI Jornadas de
Classificacao e Anlise de Dados, Lisbon,
April 1-3, (2004).
V. Chandola, S. Boriah and V. Kumar, A
Framework for Exploring Categorical Data,
SIAM (2009) pp.187-198.
M. Setnes, R. Babuˇska, U. Kaymak and
H.R.V.N Lemke, Cybernetics 28, No. 3
(1998) 376.
W. Wang and O.R. Zaiane, Clustering Web
Sessions by Sequence Alignment, Third
International Workshop on Management of
Information on the Web in Conjunction with
th International Conference on Database
and Expert Systems Applications (2002) pp.
–398.
A. Ahmad and L. Dey, Algorithm for Fuzzy
Clustering of Mixed Data with Numeric and
Categorical Attributes. Springer -Verlag
Berlin Heidelberg (2005) pp. 561 – 572.
G. Castellano, F. Mesto, M. Minunno and
M. Torsello, A. Web User Profiling Using
Fuzzy Clustering, Springer (2007) pp. 94-