OPTIMAL SAMPLING STRATEGY FOR DATA MINING

Authors

  • A. Ghaffar Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan
  • M. Shahbaz Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan
  • W. Mahmood Al-Khwarzmi Institute of Computer Sciences, University of Engineering and Technology, Lahore, Pakistan

Abstract

Latest technology like Internet, corporate intranets, data warehouses, ERP‘s, satellites, digital sensors, embedded systems, mobiles networks all are generating such a massive amount of data that it is getting very difficult to analyze and understand all these data, even using data mining tools. Huge datasets are becoming a difficult challenge for classification algorithms. With increasing amounts of data, data mining algorithms are getting slower and analysis is getting less interactive. Sampling can be a solution. Using a fraction of computing resources, Sampling can often provide same level of accuracy. The process of sampling requires much care because there are many factors involved in the determination of correct sample size. The approach proposed in this paper tries to find a solution to this problem. Based on a statistical formula, after setting some parameters, it returns a sample size called “sufficient sample sizeâ€, which is then selected through probability sampling. Results indicate the usefulness of this technique in coping with the problem of huge datasets.

References

W. Frawley, G. Piatetsky-Shapiro and

C. Matheus, Knowledge Discovery in

Databases: An Overview, AI Magazine

(1992) pp. 213–228.

D. Hand, H. Mannila and P. Smyth,

Principles of Data Mining. MIT Press,

Cambridge, MA. ISBN 0-262-08290-X

(2001).

http://en.wikipedia.org/wiki/Statistical_ classification.

H. Jochen, U. Guntzer and G. Nakhaeizadeh.

Algorithms for Association Rule Mining – A

General Survey and Comparison. SIGKDD

Explorations, 2, No. 1 (2000) 58.

S.K. Murthy. Automatic Construction of

Decision Trees from Data: A Multidisciplinary Survey. Data Mining and

Knowledge Discovery 2 (1998) 345.

J. Han and M. Kamber.Data Mining Concepts

and Techniques, Morgan Kauf-mann

Publishers (2000).

A.K. Jain, M.N. Murty and P.J. Flynn, Data

clustering: A review, ACM Computing

Surveys 31, No. 3 (1999) 264.

H. Liu and H. Motoda. Instance Selection and

Construction for Data Mining. Kluwer

Academic Publishers (2001).

B. Chandra, P. Paul Varghese. Information

Sciences 179, No. 8, (2009) 1059.

R.J. Freund and W. J. Wilson. Statistical

Methods. Academic Press, Inc., San Diego,

CA, USA (1997).

T. Lim, W. Loh and Y. Shih. A Comparison of

Prediction Accuracy, Complexity and

Training Time of Thirty-three Old and New

Classification Algorithms, Machine Learning

(1999).

W. DuMouchel. Handbook of Massive Data

Sets, Chapter Data Squashing: Constructing

summary data sets, Kluwer Academic

Publishers (2001) pp. 1-13.

http://www.cs.sfu.ca/~han/bk/7class.ppt.

http://scholar.google.com/scholar?q=

Determining+Sample+Size+for+research+act

ivities&hl=en&lr=&btnG=Search.

http://www.cse.unsw.edu.au/~billw/cs9414/

notes/ml/06prop/id3/id3.html.

D. Barbara, W. DuMouchel, C. Faloutsos,

P.J. Haas, J.M. Hellerstein, Y. Ioan-nidis,

H.V. Jagadish, T. Johnson, R. Ng, V.

Poosala, K.A. Ross and K. Sevick. The New

Jersey data reduction report. Bulletin of the

IEEE Computer Society Technical

Committee on Data Engineering (1997).

L.R. Gay and P.L. Diehl, Research Methods

for Business and Management, New York,

Macmillan (1992).

J. Gehrke, V. Ganti, R. Ramakrishnan and

W.Y. Loh. Boat- Optimistic Decision Tree Construction, In Proceedings of SIGMOD'99

(1999).

J. Catlett. Megainduction: A test flight. In

Proceedings of the Eighth International

Workshop on Machine Learning, Morgan

Kaufmann (1991) pp. 596-599.

F. Provost, D. Jensen and T. Oates. Efficient

progressive sampling. In Proceedings of the

Fifth ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining

(KDD'99), AAAI/MIT Press (1999) pp. 23-32.

Baohua Gu, Random Sampling for

Classification on Large Data Sets, MSc

Thesis, National University of Singapore,

(2002).

http://en.wikipedia.org/wiki/Data_mining.

G.H. John and P. Langley. Static versus

dynamic sampling for data mining. In

Proceedings of the Second International

Conference on Knowledge Discovery and

Data Mining (KDD'96). AAAI / MIT Press

(1996).

N.A. Syed, H. Liu and K.K. Sung. A study of

support vectors on model independent

example selection. In Proceedings of the

ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining

(KDD-99, 1999).

http://www.intuitor.com/statistics/Central Lim.

html

http://mlr.cs.umass.edu/ml/datasets.html

http://en.wikipedia.org/wiki/Central_limit_

theorem

M. Marasinghe, W. Meeker, D. Cook and

T.S. Shin, Using Graphics and Simulation to

Teach Statistical Concepts, Paper presented

at the Annual Meeting of the American

Statistician Association, Toronto, Canada

(August, 1994).

Miaoulis, George and R. D. Michener, An

Introduction to Sampling, Dubuque, Lowa:

Kendall/Hunt Publishing Company (1976).

T. Oates and D. Jensen. The Effects of

Training Set Size on Decision Tree Complexity, In Proceedings of the Fourteenth

International Conference on Machine

Learning (1997).

W.G. Cochran, Sampling techniques (3rd

Ed.), New York: John Wiley & Sons (1977).

http://mlr.cs.umass.edu/ml/datasets/Letter+R

ecognition

http://www.xlstat.com/Download.htm

http://www.geocities.com/adotsaha/CTree/

CtreeinExcel.html

http://eric.univ-lyon2.fr/~ricco/tanagra/ index.

html.

Downloads

Published

29-08-2013

How to Cite

[1]
A. Ghaffar, M. Shahbaz, and W. Mahmood, “OPTIMAL SAMPLING STRATEGY FOR DATA MINING”, The Nucleus, vol. 50, no. 3, pp. 219–228, Aug. 2013.

Issue

Section

Articles

Most read articles by the same author(s)