Improved K-Means Map Reduce Algorithm for Big Data Cluster Analysis
Agnivesh1, Rajiv Pandey2, Amarjeet Singh3

1Agnivesh, AIIT, Amity University, Lucknow, India.
2Dr. Rajiv Pandey, AIIT, Amity University, Lucknow, India.
3Dr. Amarjeet Singh, Department of Computer Science, Sriram Institute of Technology and Management, Kashipur, India.
Manuscript received on 28 Maye 2019 | Revised Manuscript received on 05 June 2019 | Manuscript published on 30 June 2019 | PP: 1796-1802 | Volume-8 Issue-8, June 2019 | Retrieval Number: H6840058719/19©BEIESP
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (

Abstract: In the present times of big data, large volumes of broad variety data are generated at high velocities every day. These big data contain unknown valuable information. To mine and extract knowledge from these big data, fast and scalable big data analytics are required. Clustering is a remarkable data mining technique. K-means clustering for data mining is of great interest because of its simplicity. However, there are certain limitations in K-means for analyzing big data which leave scope for successive improvements. Distributed processing frameworks and algorithms are helpful to obtain performance and scalability needs of analyzing big datasets. This research work designs a parallel K-means clustering algorithm by improving standard K-means in MapReduce paradigm. The proposed work presents a method to find initial seeds of clusters instead of randomly selecting them which is a major drawback in standard K-means for clustering big data. The research minimizes MapReduce iteration dependence also. Moreover, the presented algorithm takes into consideration between cluster separation and within cluster compactness to achieve accurate clustering. Cloud computing is applied in which Amazon Elastic MapReduce 5.x is used. It distributes the job of clustering among various nodes in parallel using low cost machines. The proposed work is simulated on some real datasets from UC Irvine Machine Learning Repository. The results confirm that the research work helps achieve higher performance and outperforms classical K-means while clustering large datasets.
Keywords: Artificial Intelligence, Big Data, Cloud Computing, K-Means, MapReduce

Scope of the Article: Big Data