RiskBERT: A Pre-trained Insurance-Based Language Model for TEXT Classification
Rida Ghafoor Hussain
Rida Ghafoor Hussain, Researcher, Department of Information Engineering, University of Florence, Siena Italy.
Manuscript received on 19 April 2025 | First Revised Manuscript received on 27 April 2025 | Second Revised Manuscript received on 16 May 2025 | Manuscript Accepted on 15 June 2025 | Manuscript published on 30 June 2025 | PP: 12-18 | Volume-14 Issue-7, June 2025 | Retrieval Number: 100.1/ijitee.F109714060525 | DOI: 10.35940/ijitee.F1097.14070625
Open Access | Editorial and Publishing Policies | Cite | Zenodo | OJS | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: The rapid growth of insurance-related documents has increased the need for efficient and accurate text classification techniques. Advances in natural language processing (NLP) and deep learning have enabled the extraction of valuable insights from textual data, particularly in specialised domains such as insurance, legal, and scientific documents. While Bidirectional Encoder Representations from Transformers (BERT) models have demonstrated state-of-theart performance across various NLP tasks, their application to domain-specific corpora often results in suboptimal accuracy due to linguistic and contextual differences. In this study, I propose RiskBERT, a domain-specific language representation model pre-trained on insurance corpora. I further pre-trained LegalBERT on insurance-specific datasets to enhance its understanding of insurance-related texts. The resulting model, RiskBERT, was then evaluated on downstream clause and provision classification tasks using two benchmark datasets – LEDGAR and Unfair ToS. I conducted a comparative analysis against BERT-Base and LegalBERT to assess the impact of domain-specific pre-training on classification performance. The findings demonstrate that pre-training on insurance-specific corpora significantly improves the model’s ability to analyse complex insurance texts. The experimental results show that RiskBERT significantly outperforms LegalBERT and BERT-Base, achieving superior accuracy in classifying complex insurance texts as 96.8% and 92.1%, respectively, on the LEDGAR and Unfair ToS datasets. These findings highlight the effectiveness of domain-adaptive pre-training and underscore the importance of specialized language models for improving insurance document processing, making RiskBERT a valuable tool for insurance document processing.
Keywords: Clause, Domain-Specific, Insurance, Legal, Pre-Training,
Scope of the Article: Information Technology