A Semi-Automated Record De-Duplication Technique for a Data Warehouse Environment
Vaishali C. Wangikar1, Sachin N. Deshmukh2, Sunil G. Bhirud3

1Vaishali Wangikar*, Research Scholar, Department of Computer Engineering and Technology, MIT Academy of Engineering, Alandi, Pune, India.
2Sachin Deshmukh, Professor, Department of Computer Science and Information Technology, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, India.
3Sunil Bhirud, Professor, Department of Computer Engineering and Information Technology, VeermataJeejabai Technological University, Matunga, Mumbai, India.
Manuscript received on December 15, 2019. | Revised Manuscript received on December 20, 2019. | Manuscript published on January 10, 2020. | PP: 2914-2920 | Volume-9 Issue-3, January 2020. | Retrieval Number: B6265129219/2020©BEIESP | DOI: 10.35940/ijitee.B6265.019320
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Quality of Record de-duplication is a key factor in decision making process. Correctness in the identification of duplicates from a dataset provides a strong foundation for inference. Blocking is a popular technique in de-duplication. In the traditional de-duplication process blocking key is decided by the domain expert. In real time systems, automation of blocking key generation is a primary requirement. Blocking key generation without any human intervention is the objective of this paper. The proposed Automated Token Formation (ATF) algorithm is a fully automated way for blocking key generation. The attributes shortlisted by ATF are almost similar to that of the manual method for all datasets experimented. Datasets like Cora, Restaurant, and FEBRL are used. It is observed that the token provided by ATF has shown 20 % poor results over manual tokens for Cora dataset while for the other two datasets results are matching with manual tokens. A modification is made to ATF to improve the quality of the result by Semi-Automated Token Formation (SATF) algorithm. SATF is a semi-automated approach where training data is needed. SATF has shown better performance over all the manual tokens as well as tokens by ATF. 
Keywords: Automated Blocking Key Formation, Record De-duplication, Record Linkage, Semi-Automated Blocking Key Generation, Automated Record Linkage, Unsupervised Record linkage Token Formation.
Scope of the Article:  Environmental Engineering