A Benchmark for Suitability of Alluxio over Spark
Kanchana Rajaram1, Kavietha Haridass2

1Kanchana Rajaram, Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, India.
2Kavietha Haridass*, Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, India. 

Manuscript received on September 25, 2020. | Revised Manuscript received on November 06, 2020. | Manuscript published on November 10, 2021. | PP: 245-250 | Volume-10 Issue-1, November 2020 | Retrieval Number: 100.1/ijitee.A81901110120| DOI: 10.35940/ijitee.A8190.1110120
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Big data applications play an important role in real time data processing. Apache Spark is a data processing framework with in-memory data engine that quickly processes large data sets. It can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. Spark’s in-memory processing cannot share data between the applications and hence, the RAM memory will be insufficient for storing petabytes of data. Alluxio is a virtual distributed storage system that leverages memory for data storage and provides faster access to data in different storage systems. Alluxio helps to speed up data intensive Spark applications, with various storage systems. In this work, the performance of applications on Spark as well as Spark running over Alluxio have been studied with respect to several storage formats such as Parquet, ORC, CSV, and JSON; and four types of queries from Star Schema Benchmark (SSB). A benchmark is evolved to suggest the suitability of Spark Alluxio combination for big data applications. It is found that Alluxio is suitable for applications that use databases of size more than 2.6 GB storing data in JSON and CSV formats. Spark is found suitable for applications that use storage formats such as parquet and ORC with database sizes less than 2.6GB. 
Keywords: Alluxio, Spark, File formats, Benchmark, performance, VDFS.