Recognition of Hindi and Bengali Handwritten and Typed Text from Images using Tesseract on Android Platform
Shubhendu Banerjee1, Sumit Kumar Singh2, Atanu Das3, Rajib Bag4
1Shubhendu Banerjee*, Department of CSE, Narula Institute of Technology, India.
2Sumit Kumar Singh, Department of CSE, Narula Institute of Technology, India.
3Atanu Das, Department of CSE, Netaji Subhash Engineering College, India.
4Rajib Bag, Department of CSE, Supreme Knowledge Foundation Group of Institutions, India.
Manuscript received on October 15, 2019. | Revised Manuscript received on 24 October, 2019. | Manuscript published on November 10, 2019. | PP: 3507-3516 | Volume-9 Issue-1, November 2019. | Retrieval Number: A5252119119/2019©BEIESP | DOI: 10.35940/ijitee.A5252.119119
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: The concept of digitization has marked a revolution in the area of data conversion, data storage and data sharing by converting non-editable typographic & handwritten text into editable electronic text. Though numerous such works have been carried out across the world in various languages using Optical Character Recognition (OCR), satisfactory output has been observed only in a few languages. This paper is an endeavor towards taking a step ahead in the digitization of two of the most extensively spoken languages in the Indian sub-continent – Hindi and Bengali – using Google’s open source OCR Engine, Tesseract. Working on the scripts of these two languages of Brahmi origin has its own challenges owing to their varied traits of character segmentation and word formation. Here, the training of Tesseract with data sets of Hindi and Bengali typographic and handwritten characters has been integrated with an inimitable pre-processing stage involving input image customization and image augmentation that significantly enhances the image quality allowing Tesseract to offer more accurate results, especially in cases of handwritten texts and obscure images. Besides, it also incorporates the features of English translation and text to speech translation which render their significance among the non-natives and visually impaired mass. The focal idea of this paper has been to reach out to an extended mass by enabling digitization on the Android platform. Comparative analysis carried out on three distinctive parameters – on images with typographic texts, handwritten texts and on inferior quality images – shows that the paper, to a certain extent, does succeed in projecting superior output in at least two cases as compared to the most consistent Android application of today’s time.
Keywords: Androids, Handwriting Recognition, Optical Character Recognition, Pattern Recognition.
Scope of the Article: Pattern Recognition