SMGCD: METRICS FOR BIOLOGICAL SEQUENCE DATA
Abstract
In the realm of bioinformatics, the key challenges are to manage, store and retrieve the biological data efficiently. It can be classified in to structured, unstructured and semi-structured contents. Typically, the semi-structured biological data comprised of biological sequences. The complex biological sequences produce huge volume of biological data which further produce much more problems for its management, storage and retrieval. This paper proposed metrics; namely, symmetry measure, molecular weight measure, similarity or diversity measure, size base measure, size gap measure, complexity measure and size complexity diversity measure to manage the raised problems in biological data sequences. These metrics measure the sequence complexity, molecular weights, length with gaps and without gaps, its symmetry and similarity through mathematical formulations. The metrics are demonstrated and validated using the proposed hybrid technique which combines empirical evidence with theoretical formulation. This research opens new horizons for efficient management to measure the functionality and quality of metadata for single and multiple biological sequences.References
A. Stojmirovic and Y.K. Yu, Journal of
Computational Biology 16, No. 4 (2009) 579.
D. P. Miranker, Metric-Space Search in
Bioinformatics. National Science Foundation,
Institute of National Health 2, No. 2 (2010) 32.
W. J. MacMullen and S.O. Denn, Journal of the
American Society for Information Science and
Technology 56, No. 5 (2005) 447.
A. Martinez and J. Hammer, Making Quality Count
in Biological Data Sources, Proceedings of the 2nd
International Workshop on Information Quality in
Information Systems (June 2005)
pp. 16-27.
M. Schoniger and M.S. Waterman, Bulletin of
Mathematical Biology, Elsevier 54, No. 4 (1992)
W. R. Atchley, S. J. Zhao, A. D. Fernandes and T.
Druke, Proceedings of the National Academy of
Sciences of United States of America 102, No. 18
(2005) 6395.
J. Manicassamy and P. Dhavchelvan, International
Journal of Recent Trends in Engineering 1, No. 1
(2009) 550.
J. Lee and S. Kim, Cluster Utility: A New Metric
for Clustering Biological Sequences Proceedings of
the 2005 IEEE Computational Systems
Bioinformatics Conference Workshops and Poster
Abstracts, IEEE Computer Society (August, 2005)
pp.45-46.
V. Moulton, M. Zuker, M. Steel, R. Pointon and D.
Penny, Journal of Computational Biology 7, No. ½,
(2004) 277.
M. S. Waterman, T. F. Smith and W.A. Beyer,
Journal of Advances in Mathematics 20, No. 3
(1976) 367.
B. Louie, L. Detwiler, N. Dalvi, R. Shaker, P.T.
Hornoch and D. Suciu, Incorporating Uncertainty
Metrics into a General-Purpose Data Integration
System, 19th International Conference on Scientific
and Statistical Database Management, IEEE
Computer Society (July 2007) pp. 19.
M. Li, X. Chen, X. Li and B. Maw, IEEE
Transactions on Information Theory 50, No. 12
(2004) 3250.
R. Saidi, A. Saber, M. Mondher and M.N.
Engelbert, Novel Metrics for Feature Extraction
Stability in Protein Sequence Classification.
LIMOS: Blasé Pascal University 1 (November
pp. 1-7.
A.E. Darling, A. Tritt, J.A. Eisen and M.T.
Faccoitti, Journal of Bioinformatics 27, No. 19
(August 2011) 2756.
Shazia, M. Shoaib, Iqra, K. Kalsoom, S. Majid and
F. Majeed, Pakistan Journal of Science 63, No. 1
(2011) 26.
S. Shah, Applied Mathematics Corner: DNA
Computation and Algorithm Design, Harvard
University, Cambridge, MA 02138 (2009) pp.
-89.
R. Saidi and S. Aridhi, Feature Extraction in
Protein Sequences Classification: A New Stability
Measure, Proceedings of the ACM Conference on
Bioinformatics, Computational Biology and Bio -
Medicine (2012) pp. 683-689.