Semantic Deduplication in Databases
Anju K S1, Sadhik M S2, Surekha Mariam Varghese3

1Anju K S, Department of Computer Science and Engineering, Mar Athanasius College of Engineering, Kothamangalam, India.

2Sadhik M S, Department of Computer Science and Engineering, Mar Athanasius College of Engineering, Kothamangalam, India.

3Surekha Mariam Varghese, Department of Computer Science and Engineering, Mar Athanasius College of Engineering, Kothamangalam, India.

Manuscript received on 08 April 2019 | Revised Manuscript received on 15 April 2019 | Manuscript Published on 26 April 2019 | PP: 581-585 | Volume-8 Issue-6S April 2019 | Retrieval Number: F61180486S19/19©BEIESP

Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open-access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: The presence of semantic duplicates imposes a challenge on the quality management of large datasets such as medical datasets and recommendation systems. A huge number of duplicates in large databases necessitate deduplication. Deduplication is a capacity optimization innovation that is being utilized to dramatically enhance storage efficiency. For this, it is required to identify the copies, with a quite solid approach to find as many copies as achievable and sufficiently ample to run in a sensible time. A similarity-based data deduplication is proposed by combining the methods of Content Defined Chunking (CDC) and bloom filter. These methods are exploited to look inside the files to check what portions of the data are duplicates for better storage space savings. Bloom filter is a probabilistic data structure and it is mainly used to decrease the search time. To enhance the performance of the system, methods like Locality Sensitive Hashing (LSH) and Word2Vec are also used. These two techniques are used to identify the semantic similarity between the chunks. In LSH, Levenshtein distance algorithm measures the similarity between the chunks in the repository. The deduplication performed based on semantic similarity checking improves the storage utilization and reduces the computation overhead effectively.

Keywords: Deduplication, Locality Sensitive Hashing, Bloom Filter.
Scope of the Article: Computer Science and Its Applications