Document Clustering based on Phrase and Single Term Similarity using Neo4j
Preeti Kathiria1, Harshal Arolkar2

1Preeti Kathiria*, Student, Computer Science, Gujrat Technological University, Ahemdabad, India, Assitant Professor, Computer Science and Engineering, Nirma University, Ahemdabad, India.
2Harshal Arolkar, Faculty of Computer Technology, GLS University, Ahemdabad, India.
Manuscript received on December 18, 2019. | Revised Manuscript received on December 20, 2019. | Manuscript published on January 10, 2020. | PP: 3188-3192 | Volume-9 Issue-3, January 2020. | Retrieval Number: C9050019320/2020©BEIESP | DOI: 10.35940/ijitee.C9050.019320
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Document similarity generally rely on single term similarity such as cosine similarity. To achieve better document similarity, along with single term phrase- more informative feature can be used. To find out shared phrases across the corpus the Document Index graph (DIG) representation model is used. Document representation – DIG model incrementally construct the graph and simultaneously finds the shared phrase between current document and previously inserted documents from the graph. The similarity between documents is mainly depends on the number of shared phrases and single term similarity – known as hybrid similarity. The hybrid similarities are used with well known density based clustering technique DBSCAN to assess their effect on quality of the clusters. Experimental results shows that hybrid similarity gives more accurate degree of document similarity and performs better cohesive clustering. 
Keywords: DBSCAN Clustering, Document Index model, Neo4j Graph Database, Phrase Based Similarity
Scope of the Article:  Clustering