Online Data Stream Learning and Classification with Limited Labels

Loo Hui Ru, Trias Andromeda, M. N. Marsono


Mining data streams such as Internet traffic andnetwork security is complex. Due to the difficulty of storage, datastreams analytics need to be done in one scan. This limits thetime to observe stream feature and hence, further complicatesthe data mining processes. Traditional supervised data miningwith batch training natural is not suitable to mine data streams.This paper proposes an algorithm for online data streamclassification and learning with limited labels using selective selftrainingsemi-supervised classification. The experimental resultsshow it is able to achieve up to 99.6% average accuracy for 10%labeled data and 98.6% average accuracy for 1% labeled data. Itcan classify up to 34K instances per second.


Online classification; semi-supervised; data stream mining; incremental learning


L. L. Minku, “Online ensemble learning in the presence of concept drift,” PhD. dissertation, College of Engineering & Physical Sciences, University of Birmingham, 2011.

I. Zliobaite, A. Bifet, G. Holmes, and B. Pfahringer. “Moa concept drift active learning strategies for streaming data,” in WAPA, 2011 pp 48–55.

C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “On demand classification of data streams,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’04, New York, NY, USA: ACM, 2004, pp. 503–508.

G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams,” in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD’01. New York, NY, USA: ACM, 2001, pp. 97–106.

X. Zhu, “Semi-supervised learning literature survey,” Computer Science, University of Wisconsin-Madison, vol. 2 p. 3, 2006.

M. M. Masud, C. Woolam, J. Gao, L. Khan, J. Han, K. W. Hamlen, and N. C. Oza, “Facing the reality of data stream classification: coping with scarcity of labeled data,” Knowledge and information systems, vol. 33, no. 1, pp. 213–244, 2012.

J. LIU, G.-s. XU, S.-h. ZHENG, D. XIAO, and L.-z. GU, “Data streams classification with ensemble model based on decision-feedback,” The Journal of China Universities of Posts and Telecommunications, vol. 21, no. 1, pp. 79–85, 2014.

A. Shrivastav and A. Tiwari, “Network traffic classification using semisupervised approach,” in IEEE 2010 Second International Conference on Machine Learning and Computing (ICMLC), 2010, pp. 345–349.

U. Thakar, V. Tewari, and S. Rajan, “A higher accuracy classifier based on semi-supervised learning,” in IEEE International Conferencec on Computational Intelligence and Communication Networks (CICN), 2010, pp. 665–668.

X. Wu, P. Li, and X. Hu, “Learning from concept drifting data streams with unlabeled data,” Neurocomputing, vol. 92, pp. 145–155, 2012.

D. Ienco, A. Bifet, I. Zliobaite, and B. Pfahringer, “Clustering based active learning for evolving data streams,” in Discovery Science. Lecture Notes in Computer Science, 2013, vol. 8140, pp 79–93.

T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: an efficient data clustering method for very large databases”, in ACM SIGMOD Record, vol. 25, ACM, 1996, pp. 103–114.

KDD 1999 Computer Network Intrusion Detection. [Online]. Available:

A. Moore, D. Zuev, and M. Crogan, “Discriminators for use in flowbased classification,” Department of Computer Science, Queen Mary, University of London, Tech. Rep., August 2005.

H. Jamil, A. Mohammed, A. Hamza, S. Nor, and M. Marsono, “Selection of on-line features for peer-to-peer network traffic classification,” in Recent Advances in Intelligent Informatics, ser. Advances in Intelligent Systems and Computing, S. M. Thampi, A. Abraham, S. K. Pal, and J. M. C. Rodriguez, Eds. Springer International Publishing, 2014, vol. 235, pp. 379–390.

A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “Moa: Massive online analysis,” The Journal of Machine Learning Research, vol. 11, pp. 1601–1604, 2010.

Full Text: PDF


  • There are currently no refbacks.