Text Mining For Multiclass Research Paper Categorization
Amber Saxena1, Anamika2, Bhaskar Pant3, Vikas Tripathi4
1Amber Saxena*, Computer Science and Engineering, Graphic Era deemed to be University, Dehradun, India.
2Anamika, Computer Science and Engineering, Graphic Era deemed to be University, Dehradun, India.
3Bhaskar Pant, Computer Science and Engineering, Graphic Era deemed to be University, Dehradun, India.
4Vikas Tripathi, Computer Science and Engineering, Graphic Era deemed to be University, Dehradun, India.
Manuscript received on November 16, 2019. | Revised Manuscript received on 25 November, 2019. | Manuscript published on December 10, 2019. | PP: 2612-2615 | Volume-9 Issue-2, December 2019. | Retrieval Number: B7240129219/2019©BEIESP | DOI: 10.35940/ijitee.B7240.129219
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: A research paper is a rich source of academic and innovative writing on a particular topic, and they are unstructured in nature. Categorization of documents refers to classification of documents in classes that are predefined. It is arduous for a user to categories research paper in different domains: because extracting meaningful and relevant words from the research paper is a challenging task. For extracting important information we have used certain methods and classifiers. Methods like bag of words and tfidf is used for processing data. Prepossessing the data includes string tokenizing and stop-word removal. Then the processed data is classified using SVM classifier. For multiclass classification; since predefined classes are 4, therefore 1-v-r classifier is used. The system performance is 88% with 800 training and 200 testing documents. It is analyzed that the model performs better when the training data is more. The aim of this work is to categorize the documents and allocate set of predefined tag to them. It also evaluates the performance of the model by considering different percentages for training and testing sets of documents.
Keywords: Categorization, tf-idf, Bag of Words, SVM, one Versus-rest.
Scope of the Article: Computer Graphics, Simulation, and Modelling