Effective Utilization of Storage Space by Applying File Level and Block-Level Deduplication over HDFS
Sachin Arun Thanekar1, Kodukula Subrahmanyam2, AliAkbar Bagwan3

1Sachin Arun Thanekar, P.H.D. Scholar, Department of Computer Science & Engineering, KLEF, Vaddeswaram, Guntur Andhra Pradesh, India.

2Kodukula Subrahmanyam, Professor, Department of Computer Science & Engineering, KLEF, Vaddeswaram, Guntur Andhra Pradesh, India.

3Ali Akbar Bagwan, Professor, Department of Computer Engineering, Rajarshi Shahu College of Engineering, Tathwade, Pune (Maharashtra), India.

Manuscript received on 08 April 2019 | Revised Manuscript received on 15 April 2019 | Manuscript Published on 26 April 2019 | PP: 725-730 | Volume-8 Issue-6S April 2019 | Retrieval Number: F61600486S19/19©BEIESP

Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open-access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Hadoop framework is very efficient and easy to handle huge records storage as well as its processing. Hadoop makes use of massive commodity hardware clusters to save and process massive data in an allotted fashion. Open Source, Massive information handling capabilities and faster processing abilities made it very popular. Existing Hadoop Framework destroys metadata of preceding jobs, it actually allocates Data Nodes via ignoring what it has processed earlier and hence for each new process it reads data from all Data Nodes. There isn’t any provision made for checking relationships between similar data blocks. Thus it weakens the Hadoop overall performance. The uploaded big data files are partitioned in to number of blocks and are distributed over node clusters. To avoid random block distribution and data-duplication, deduplication system is used. Such deduplication system focuses on space management and only keeps track of data files on Hadoop Distributed File System (HDFS). Such system do not participate in efficient job execution in map reduce environment. For efficient execution of job, data locality information and job metadata is stored. Time required for job execution can be decreased for next execution of same job by preserving job metadata. A combined environment produce efficient job execution results with efficient space management.

Keywords: HDFS, Hadoop, Map Reduce, Big Data, H2hadoop.
Scope of the Article: Computer Science and Its Applications