Predicting Alert Source Device using Machine Learning Algorithms
Bharath M. B.1, D. V. Ashoka2

1Bharath M. B.*, Dell EMC Software & Services India Pvt. Ltd, Bangalore, India.
2Dr. D. V. Ashoka, JSS Academy of Technical Education, Bangalore, India.
Manuscript received on June 12, 2020. | Revised Manuscript received on June 25, 2020. | Manuscript published on July 10, 2020. | PP: 1-10 | Volume-9 Issue-9, July 2020. | Retrieval Number: 100.1/ijitee.D1526029420| DOI: 10.35940/ijitee.D1526.079920
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: In a large distributed virtualized environment, predicting the alerting source from its text seems to be daunting task. This paper explores the option of using machine learning algorithm to solve this problem. Unfortunately, our training dataset is highly imbalanced. Where 96% of alerting data is reported by 24% of alerting sources. This is the expected dataset in any live distributed virtualized environment, where new version of device will have relatively less alert compared to older devices. Any classification effort with such imbalanced dataset present different set of challenges compared to binary classification. This type of skewed data distribution makes conventional machine learning less effective, especially while predicting the minority device type alerts. Our challenge is to build a robust model which can cope with this imbalanced dataset and achieves relative high level of prediction accuracy. This research work stared with traditional regression and classification algorithms using bag of words model. Then word2vec and doc2vec models are used to represent the words in vector formats, which preserve the sematic meaning of the sentence. With this alerting text with similar message will have same vector form representation. This vectorized alerting text is used with Logistic Regression for model building. This yields better accuracy, but the model is relatively complex and demand more computational resources. Finally, simple neural network is used for this multi-class text classification problem domain by using keras and tensorflow libraries. A simple two layered neural network yielded 99 % accuracy, even though our training dataset was not balanced. This paper goes through the qualitative evaluation of the different machine learning algorithms and their respective result. Finally, two layered deep learning algorithms is selected as final solution, since it takes relatively less resource and time with better accuracy values. 
Keywords:  Fault management, Unstructured data, Machine learning, and Event classification.
Scope of the Article: Machine Learning