The Unprejudiced Stemmer to Prevent Etymological Behavior of Stemmed Morphemes Of Social Media Corpora
Akula .V.S. Siva Rama Rao1, Ranjana .P2
1Akula .V.S. Siva Rama Rao*, Research Scholar, Department of Computer Science and Engineering, Hindustan Institute of Technology and Science, Chennai, India. Associate Professor, Dept. of CSE, SITE, Tadepalligudem, AP, India
2Ranjana. P, Professor, Department of Computer Science and Engineering, Hindustan Institute of Technology and Science, Chennai, India.
Manuscript received on November 17, 2019. | Revised Manuscript received on 28 November, 2019. | Manuscript published on December 10, 2019. | PP: 3718-3724 | Volume-9 Issue-2, December 2019. | Retrieval Number: B6665129219/2019©BEIESP | DOI: 10.35940/ijitee.B6665.129219
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Sentiment Analysis is an application of Natural Langue Processing to analyze social media corpora to extract insights of corpora. Sentiment analytical results are the real feedback of the customers, which enables the organizations and companies to take appropriate decision on their products and business policies. Stemming plays in-evitable and vital role in sentiment analysis. Stemming is one of the phase of preprocessing the social media corpora. Today most of the researches uses strong stemmers to identify stem words of social media corpora. The most popular stemming algorithms such as Lancaster and Porter stemming algorithms causes prejudiced the meaning of the words. The over-stemmed words mislead the sentiment classification process. To prevent the over-stemming the Unprejudiced lighter stemming algorithm is proposed to sustain the meaning of the stemmed words. The propose Un-prejudiced algorithm uses lexical database and Parts of speech of Python Natural Language Tool Kit. There are a few stemming algorithm accuracy evaluation methods, in this paper we focused on Paice Error-rate relative to truncation (ERRT) measure to evaluate the accuracy of Lancaster, Porter and Unprejudiced stemming algorithms. The experiments were conducted on 25,758 source words and results were evaluated using Paice stem evaluation method and Sirsat method. The Paice Evaluation ERRT values 0.47209, 0.28703, 0.15502 of Lancaster, Porter, Unprejudiced respectively are proved that the Unprejudiced stemmer is more accurate than Lancaster and Porter. Sirsat’s stem evaluation method Average Words Conflation Factor (AWCF) results 10310.31, 14031.17, 23349.87 of Lancaster, Porter, Unprejudiced respectively are also proved the Unprejudiced stemming algorithm is more accurate than Lancaster and Porter stemming algorithms.
Keywords: Sentiment Analysis, Social Media Corpora, Pre-processing, Etymology, Natural Language Processing, Stem Weight, Error-rate relative to truncation.
Scope of the Article: Natural Language Processing