CHALLENGES IN DETERMINING TERM RELEVANCE FOR TEXT DATA

Authors

  • A. Rehman Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan
  • K. Javed Department of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan
  • H. A. Babri Department of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan

Abstract

Although the nature of text data is different from ordinary non-text datasets in a number of ways, existing algorithms from Machine Learning domain have been borrowed for the classification of text data. Machine learning algorithms cannot be readily applied on raw text data. Text data needs to be transformed to a suitable form for the application of machine learning algorithms. The transformation produces further problems for feature selection and classification algorithms. In this paper we highlight the problems introduced by transformation of text data. We also show how different feature selection algorithms including bi-normal separation, information gain and ROC are affected by text data.

References

References

Li-Ping Jing, H.K. Huang and H.B. Shi, Proc. of

Int. Conf. on Improved Feature Selection

Approach to tfidf in Text Mining, Machine

Learning and Cybernetics 2 (2002) pp. 944-946.

P.M. Ciarelli, E. Oliveira, C. Badue and A.F. De

Souza, International Journal of Computer

Information Systems and Industrial Management

Applications 1 (2009) 133.

X.B. Xue and Z.H. Zhou, IEEE Transactions on

Knowledge and Data Engineering 21, No. 3

(2009) 428.

H. Moisl, Data Normalization for Variation in

Document Length in Exploratory Multivariate

Analysis of Text Corpora (2008).

A. Singhal, G. Salton and C. Buckley, Length

Normalization in Degraded Text Collections,

Proceedings of Fifth Annual Symposium on

Challenges in determining term relevance for text data 185

Document Analysis and Information Retrieval,

(1995) p. 1517.

T. Joachims. Proc. of 24th Annual Int. ACM

SIGIR Conference on Research and Development

in Information Retrieval, NY, USA (2001) p. 128.

M. Baroni, 39 distributions in Text, University of

Trento (2005). http://clic.cimec.unitn.it/marco/

publications/hsk_39_dist_rev2.pdf

M.J. Saary, Journal of Clinical Epidemiology 61,

No. 4 (2008) 311.

D. Fouarge and R. Muffers, Social Exclusion in

European Welfare States, urn:nbn:nl:ui:27-21326.

(2002).

David D. Lewis. Reuters-21578.

G. Forman, Proceedings of the 21st Int. Conf. on

Machine Learning, ACM (2004) p. 38.

G. Forman, I. Guyon and A. Elisseeff, Journal of

Machine Learning Research 3 (2003) 12891305.

Y. Yang and J.O. Pedersen, Proc. of the 14th Int.

Conf on Machine Learning, San Francisco, CA,

USA (1997) p. 412.

Downloads

Published

26-05-2014

How to Cite

[1]
A. Rehman, K. Javed, and H. A. Babri, “CHALLENGES IN DETERMINING TERM RELEVANCE FOR TEXT DATA”, The Nucleus, vol. 51, no. 2, pp. 177–185, May 2014.

Issue

Section

Articles