PHD Research proposal in Ensemble Machine Learning for Big Data Stream Processing

In recent years, the progressive development of sensor technology, and communication have enabled the rapid generation of infinite and continuous data referred to data stream [1]. This stream data poses a severe challenge for data mining and machine learning methods owing to the evolving nature of size and speed. Thus, the machine learning and computational intelligence community shifted their attention towards the ensemble methods of machine learning, a set of the learning algorithm also referred to as an ensemble system for the data stream analysis [2]. The ensemble methods of machine learning model intend to ensure the improved accuracy of the prediction model by training the multiple learning algorithm or weak learners to mitigate the variance between the learning algorithm and improve the performance of the automated decision-making system.
The ensemble methods have been significantly utilized in feature selection, incremental learning, confidence estimation, error correction, missing features, imbalanced data, and the learning of concept drift. It makes the right decision by weighing diverse conceptions and merging them to predict the formerly unseen record of data accurately. Currently, ensemble learning has several real-time applications include image recognition, data mining, scene segmentation and analysis, object identification and tracking, information retrieval, characterize computer security issue, bankruptcy prediction, credit card fraud detection, species distributions prediction and so on [3].
Despite several developments in the data stream mining, there remain many research issues, and challenges are unresolved that need to properly vanish. Notably, in the dynamically evolving data stream, relation within the attributes and the target values most probably locally valid that induces the complexity in the mining of data stream. Another open problem is tuning the streaming ensembles parameters that entail some additional attention mechanism. Almost, most of the streaming ensembles attempt to handle the single stream alone. Nevertheless, several applications bring the many parallel streams, for instance, studies on censored data and internet messages in which the similar data event occurs in disparate time moments and possibly have the various descriptions that pose the challenges.
In the stream data context, processing of asynchronous arrival of data and delayed information based on the ensembles still leaves the complexity. The works on the stream data mining often suffer from the issues of concept drift, imbalance class, the absence of values, confined labeled instances, temporal dependencies, overfeeding, novel classes, and the insufficiency of resources [5]. However, the ensemble learning methods have the capability to handle the large-scale stream data under the concept drift scenario. Even though, the detailed features of drifts have still not persistently studied. Therefore, developing the ensembles to handle the various categories of drift is a non-trivial task. Also, the suspicious change of stream data induces the complexity in the multi-label classification task.

Reference:

  • [1] Namiot, Dmitry, “On big data stream processing”, International Journal of Open Information Technologies, Vol.3, No.8, 2015.

  • [2] Gomes, Heitor Murilo, Jean Paul Barddal, Fabrício Enembreck, and Albert Bifet, “A survey on ensemble learning for data stream classification”, ACM Computing Surveys (CSUR), Vol.50, No.2, pp.23, 2017.

  • [3] Rahman, Akhlaqur, and Sumaira Tasnim, “Ensemble classifiers and their applications: a review”, arXiv preprint arXiv:1404.4088, 2014.

  • [4] Krawczyk, Bartosz, Leandro L. Minku, João Gama, Jerzy Stefanowski, and Michał Woźniak, “Ensemble learning for data stream analysis: A survey”, Information Fusion, Vol.37, pp.132-156, 2017.

  • [5] Krempl, Georg, Indre Žliobaite, Dariusz Brzeziński, Eyke Hüllermeier, Mark Last, Vincent Lemaire, Tino Noack et al, “Open challenges for data stream mining research”, ACM SIGKDD explorations newsletter, Vol.16, No.1, pp.1-10, 2014.