GMM-UBM Based Modeling for Language Identification using New Feature Vectors
A. Nagesh1, M. Sadanandam2

1Dr. A. Nagesh*, Professor, Department of CSE, MGIT, Hyderabad, India.
2Dr. M. Sadanandam, Assistant Professor & BOS, Department of Computer Science & Engineering at Kakatiya University, Warangal, India.
Manuscript received on January 16, 2020. | Revised Manuscript received on January 22, 2020. | Manuscript published on February 10, 2020. | PP: 3034-3039 | Volume-9 Issue-4, February 2020. | Retrieval Number: D1919029420/2020©BEIESP | DOI: 10.35940/ijitee.D1919.029420
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: IThe most of the existing LID systems based on the Gaussian Mixture model. The main requirement of the GMM based LID system is it require large amount of speech data to train the GMM model. Most of the Indian languages have the similarity because they are derived from Devanagari. Even though common phonemes exists in phoneme sets across the Indian languages, each language contain its unique phonotactic constraints imposed by the language. Any modeling technique capable of capturing all these slight variations imposed by the language is one of the important language identification cue. To model the GMM based LID system which captures above variations it require large number of mixture components.To model the large number of mixture components using Gaussian Mixture Model (GMM), the technique requires a large number of training data for each language class, which is very difficult to get for Indian languages. The main objective of GMM-UBM based LID system is it require less amount of training data to train(model) the system. In this paper, the importance of GMM-UBM modeling for language identification (LID) task for Indian languages are explored using new set of feature vectors. In GMM-UBM LID system based on the new feature vectors, the phonotactic variations imparted by different Indian languages are modeled using Gaussian Mixture model and Universal Background Model (GMM-UBM) technique. In this type of modeling, some amount of data from each class of language is pooled to create a universal background model. From this UBM model each model class is adapted. In this study, it is found that the performance of new feature vectors GMM-UBM based LID system is superior when compared to conventional new feature vectors based GMM LID system. 
Keywords: Universal Background Model(UBM), Gaussian Mixture Model(GMM), Language Identification (LID)
Scope of the Article: Natural Language Processing