Identification of Duplication in Questions Posed on Knowledge Sharing Platform Quora using Machine Learning Techniques
R. Rishickesh1, R.P. Ram Kumar2, A.Shahina3, A. Nayeemullah Khan4
1R. Rishickesh, Department of Information Technology, SSN College of Engineering, Kalavakkam, India.
2R.P. Ram Kumar, Department of Information Technology, SSN College of Engineering, Kalavakkam, India.
3A. Shahina, Department of Information Technology, SSN College of Engineering, Kalavakkam, India.
4A. Nayeemulla Khan, School of Computing Science and Engineering, VIT University, Chennai, India.
Manuscript received on September 17, 2019. | Revised Manuscript received on 25 September, 2019. | Manuscript published on October 10, 2019. | PP: 2444-2451 | Volume-8 Issue-12, October 2019. | Retrieval Number: L30171081219/2019©BEIESP | DOI: 10.35940/ijitee.L3017.1081219
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Quora, an online question-answering platform has a lot of duplicate questions i.e. questions that convey the same meaning. Since it is open to all users, anyone can pose a question any number of times this increases the count of duplicate questions. This paper uses a dataset comprising of question pairs (taken from the Quora website) in different columns with an indication of whether the pair of questions are duplicates or not. Traditional comparison methods like Sequence matcher perform a letter by letter comparison without understanding the contextual information, hence they give lower accuracy. Machine learning methods predict the similarity using features extracted from the context. Both the traditional methods as well as the machine learning methods were compared in this study. The features for the machine learning methods are extracted using the Bag of Words models- Count-Vectorizer and TFIDF-Vectorizer. Among the traditional comparison methods, Sequence matcher gave the highest accuracy of 65.29%. Among the machine learning methods XGBoost gave the highest accuracy, 80.89% when Count-Vectorizer is used and 80.12% when TFIDF-Vectorizer is used.
Keywords: Quora Question Pairs, Count-Vectorizer, TFIDF-Vectorizer, Machine Learning Methods
Scope of the Article: Machine Learning