Enhancing K-means for Multidimensional Big Data Clustering using R on Cloud
Agnivesh1, Rajiv Pandey2, Amarjeet Singh3

1Agnivesh, AIIT, Amity University, Lucknow, India
2Dr. Rajiv Pandey, AIIT, Amity University, Lucknow, India
3Dr. Amarjeet Singh, Department of Computer Science, Sriram Institute of Technology and Management, Kashipur, India
Manuscript received on 05 May 2019 | Revised Manuscript received on 12 May 2019 | Manuscript published on 30 May 2019 | PP: 697-703 | Volume-8 Issue-7, May 2019 | Retrieval Number: G5574058719/19©BEIESP
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: One of the critical problems with K-means clustering is that it only converges to local optima which is easier than solving for global optima but can lead to less optimal convergence. This is particularly true for big data as the initial centers play a very important role on the performance of this algorithm. The paper proposes a novel K-means algorithm which presents a method to find optimized location of initial centers and initial number of clusters. This results in obtaining final set of clusters to converge globally, facilitating fast and accurate clustering over large datasets. Cloud computing implements massive scale and complex computing. Large amounts of data are inexpensively and efficiently analyzed by using parallelism technique. To acquire parallelism and scalable computing, R Studio server is deployed on Amazon Web Service Elastic Compute Cloud instance which divides the job among various nodes. The proposed methodology presents a very competitive performance taking considerable less computation time and cost effective. It can be compared to complex Hadoop Distributed File System and MapReduce A major drawback with Apache Hadoop is its MapReduce paradigm that is highly receptive when a process iterates number of times. R performs execution within memory which is faster and less complex as compared to Read/Write to the disk repeatedly in MapReduce. The research work is simulated on some popular real datasets from UCI Machine Learning repository. The results confirm that the proposed work models a robust and scalable technique for clustering big datasets.
Keyword: Artificial Intelligence, Big Data, Cloud Computing, K-means, MapReduce; R
Scope of the Article: Artificial Intelligence