OPTIMAL SAMPLING STRATEGY FOR DATA MINING
Abstract
Latest technology like Internet, corporate intranets, data warehouses, ERP‘s, satellites, digital sensors, embedded systems, mobiles networks all are generating such a massive amount of data that it is getting very difficult to analyze and understand all these data, even using data mining tools. Huge datasets are becoming a difficult challenge for classification algorithms. With increasing amounts of data, data mining algorithms are getting slower and analysis is getting less interactive. Sampling can be a solution. Using a fraction of computing resources, Sampling can often provide same level of accuracy. The process of sampling requires much care because there are many factors involved in the determination of correct sample size. The approach proposed in this paper tries to find a solution to this problem. Based on a statistical formula, after setting some parameters, it returns a sample size called “sufficient sample sizeâ€, which is then selected through probability sampling. Results indicate the usefulness of this technique in coping with the problem of huge datasets.References
W. Frawley, G. Piatetsky-Shapiro and
C. Matheus, Knowledge Discovery in
Databases: An Overview, AI Magazine
(1992) pp. 213–228.
D. Hand, H. Mannila and P. Smyth,
Principles of Data Mining. MIT Press,
Cambridge, MA. ISBN 0-262-08290-X
(2001).
http://en.wikipedia.org/wiki/Statistical_ classification.
H. Jochen, U. Guntzer and G. Nakhaeizadeh.
Algorithms for Association Rule Mining – A
General Survey and Comparison. SIGKDD
Explorations, 2, No. 1 (2000) 58.
S.K. Murthy. Automatic Construction of
Decision Trees from Data: A Multidisciplinary Survey. Data Mining and
Knowledge Discovery 2 (1998) 345.
J. Han and M. Kamber.Data Mining Concepts
and Techniques, Morgan Kauf-mann
Publishers (2000).
A.K. Jain, M.N. Murty and P.J. Flynn, Data
clustering: A review, ACM Computing
Surveys 31, No. 3 (1999) 264.
H. Liu and H. Motoda. Instance Selection and
Construction for Data Mining. Kluwer
Academic Publishers (2001).
B. Chandra, P. Paul Varghese. Information
Sciences 179, No. 8, (2009) 1059.
R.J. Freund and W. J. Wilson. Statistical
Methods. Academic Press, Inc., San Diego,
CA, USA (1997).
T. Lim, W. Loh and Y. Shih. A Comparison of
Prediction Accuracy, Complexity and
Training Time of Thirty-three Old and New
Classification Algorithms, Machine Learning
(1999).
W. DuMouchel. Handbook of Massive Data
Sets, Chapter Data Squashing: Constructing
summary data sets, Kluwer Academic
Publishers (2001) pp. 1-13.
http://www.cs.sfu.ca/~han/bk/7class.ppt.
http://scholar.google.com/scholar?q=
Determining+Sample+Size+for+research+act
ivities&hl=en&lr=&btnG=Search.
http://www.cse.unsw.edu.au/~billw/cs9414/
notes/ml/06prop/id3/id3.html.
D. Barbara, W. DuMouchel, C. Faloutsos,
P.J. Haas, J.M. Hellerstein, Y. Ioan-nidis,
H.V. Jagadish, T. Johnson, R. Ng, V.
Poosala, K.A. Ross and K. Sevick. The New
Jersey data reduction report. Bulletin of the
IEEE Computer Society Technical
Committee on Data Engineering (1997).
L.R. Gay and P.L. Diehl, Research Methods
for Business and Management, New York,
Macmillan (1992).
J. Gehrke, V. Ganti, R. Ramakrishnan and
W.Y. Loh. Boat- Optimistic Decision Tree Construction, In Proceedings of SIGMOD'99
(1999).
J. Catlett. Megainduction: A test flight. In
Proceedings of the Eighth International
Workshop on Machine Learning, Morgan
Kaufmann (1991) pp. 596-599.
F. Provost, D. Jensen and T. Oates. Efficient
progressive sampling. In Proceedings of the
Fifth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining
(KDD'99), AAAI/MIT Press (1999) pp. 23-32.
Baohua Gu, Random Sampling for
Classification on Large Data Sets, MSc
Thesis, National University of Singapore,
(2002).
http://en.wikipedia.org/wiki/Data_mining.
G.H. John and P. Langley. Static versus
dynamic sampling for data mining. In
Proceedings of the Second International
Conference on Knowledge Discovery and
Data Mining (KDD'96). AAAI / MIT Press
(1996).
N.A. Syed, H. Liu and K.K. Sung. A study of
support vectors on model independent
example selection. In Proceedings of the
ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining
(KDD-99, 1999).
http://www.intuitor.com/statistics/Central Lim.
html
http://mlr.cs.umass.edu/ml/datasets.html
http://en.wikipedia.org/wiki/Central_limit_
theorem
M. Marasinghe, W. Meeker, D. Cook and
T.S. Shin, Using Graphics and Simulation to
Teach Statistical Concepts, Paper presented
at the Annual Meeting of the American
Statistician Association, Toronto, Canada
(August, 1994).
Miaoulis, George and R. D. Michener, An
Introduction to Sampling, Dubuque, Lowa:
Kendall/Hunt Publishing Company (1976).
T. Oates and D. Jensen. The Effects of
Training Set Size on Decision Tree Complexity, In Proceedings of the Fourteenth
International Conference on Machine
Learning (1997).
W.G. Cochran, Sampling techniques (3rd
Ed.), New York: John Wiley & Sons (1977).
http://mlr.cs.umass.edu/ml/datasets/Letter+R
ecognition
http://www.xlstat.com/Download.htm
http://www.geocities.com/adotsaha/CTree/
CtreeinExcel.html
http://eric.univ-lyon2.fr/~ricco/tanagra/ index.
html.