CHALLENGES IN DETERMINING TERM RELEVANCE FOR TEXT DATA
Abstract
Although the nature of text data is different from ordinary non-text datasets in a number of ways, existing algorithms from Machine Learning domain have been borrowed for the classification of text data. Machine learning algorithms cannot be readily applied on raw text data. Text data needs to be transformed to a suitable form for the application of machine learning algorithms. The transformation produces further problems for feature selection and classification algorithms. In this paper we highlight the problems introduced by transformation of text data. We also show how different feature selection algorithms including bi-normal separation, information gain and ROC are affected by text data.References
References
Li-Ping Jing, H.K. Huang and H.B. Shi, Proc. of
Int. Conf. on Improved Feature Selection
Approach to tfidf in Text Mining, Machine
Learning and Cybernetics 2 (2002) pp. 944-946.
P.M. Ciarelli, E. Oliveira, C. Badue and A.F. De
Souza, International Journal of Computer
Information Systems and Industrial Management
Applications 1 (2009) 133.
X.B. Xue and Z.H. Zhou, IEEE Transactions on
Knowledge and Data Engineering 21, No. 3
(2009) 428.
H. Moisl, Data Normalization for Variation in
Document Length in Exploratory Multivariate
Analysis of Text Corpora (2008).
A. Singhal, G. Salton and C. Buckley, Length
Normalization in Degraded Text Collections,
Proceedings of Fifth Annual Symposium on
Challenges in determining term relevance for text data 185
Document Analysis and Information Retrieval,
(1995) p. 1517.
T. Joachims. Proc. of 24th Annual Int. ACM
SIGIR Conference on Research and Development
in Information Retrieval, NY, USA (2001) p. 128.
M. Baroni, 39 distributions in Text, University of
Trento (2005). http://clic.cimec.unitn.it/marco/
publications/hsk_39_dist_rev2.pdf
M.J. Saary, Journal of Clinical Epidemiology 61,
No. 4 (2008) 311.
D. Fouarge and R. Muffers, Social Exclusion in
European Welfare States, urn:nbn:nl:ui:27-21326.
(2002).
David D. Lewis. Reuters-21578.
G. Forman, Proceedings of the 21st Int. Conf. on
Machine Learning, ACM (2004) p. 38.
G. Forman, I. Guyon and A. Elisseeff, Journal of
Machine Learning Research 3 (2003) 12891305.
Y. Yang and J.O. Pedersen, Proc. of the 14th Int.
Conf on Machine Learning, San Francisco, CA,
USA (1997) p. 412.