Multilingual Generative AI Framework for Urdu and Regional Language Understanding Using Large Language Models
DOI:
https://doi.org/10.5281/zenodo.20434861Keywords:
Artificial Intelligence; Large Language Models; Urdu NLP; AI-Generated Text Detection; Multilingual Transformers; mDeBERTa; XLM-RoBERTa; DistilBERT; Low-Resource Languages; Natural Language ProcessingAbstract
The rapid advancement of Large Language Models (LLMs) has significantly enhanced automated text generation, while simultaneously increasing the difficulty of distinguishing AI-generated content from human-written text. This challenge is particularly critical for low-resource languages such as Urdu, where reliable AI-content detection systems remain limited. To address this gap, this study proposes a multilingual AI-generated text detection framework specifically designed for Urdu language processing. A balanced benchmark dataset containing 1,800 human-authored and 1,800 AI-generated Urdu texts was developed using outputs from GPT-4o mini, Gemini, and Kimi AI. Comprehensive linguistic and statistical analyses were performed using features such as vocabulary richness, Type-Token Ratio (TTR), character diversity, sentence variability, and N-gram patterns, with significance validated through T-tests and Mann–Whitney U tests. Three multilingual transformer architectures, namely mDeBERTa-v3-base, DistilBERT-multilingual, and XLM-RoBERTa-base, were fine-tuned and evaluated on the proposed dataset. Experimental results demonstrated that mDeBERTa-v3-base achieved the best performance, obtaining an F1-score of 91.29% and an accuracy of 91.26% on the test dataset. The findings confirm the effectiveness of multilingual transformer models for AI-generated Urdu text detection and highlight their potential for supporting academic integrity, misinformation prevention, and trustworthy NLP applications in underrepresented language communities.
References
[1] I. Solaiman et al., “Release Strategies and the Social Impacts of Language Models,” arXiv preprint arXiv:1908.09203, 2019.
[2] T. Gehrmann, H. Strobelt, and A. M. Rush, “GLTR: Statistical Detection and Visualization of Generated Text,” in ACL Demo, 2019.
[3] J. Carlini et al., “Evaluating and Testing Unsupervised Models of Text Generation,” arXiv preprint arXiv:2205.11933, 2022.
[4] E. Kirchenbauer et al., “A Watermark for Large Language Models,” arXiv preprint arXiv:2301.10226, 2023.
[5] A. Conneau et al., “Unsupervised Cross-lingual Representation Learning at Scale,” in Proc. ACL, 2020.
[6] M. S. Imran et al., “A survey of challenges and applications of natural language processing for under-resourced languages,” ACM Comput. Surv. (CSUR), vol. 54, no. 4, pp. 1–38, 2022.
[7] R. Zahur, M. I. Khan, and A. Farooq, “Challenges in NLP for Urdu and Its Applications in Social Media Analysis,” Int. J. Comput. Sci. Netw. Secur., vol. 22, no. 3, pp. 1–8, 2022.
[8] M. Z. Asghar et al., “Fake News Detection in Urdu Language Using Machine and Deep Learning Models,” IEEE Access, vol. 9, pp. 103051–103065, 2021.
[9] S. Bashir et al., “Developing a corpus for fake news detection in the Urdu language,” J. King Saud Univ. – Comput. Inf. Sci., 2022.
[10] S. Ruder, I. Vulić, and A. Søgaard, “A Survey of Cross-Lingual Word Embedding Models,” J. Artif. Intell. Res., vol. 65, pp. 569–631, 2019.
[11] P. He et al., “DeBERTa: Decoding-enhanced BERT with Disentangled Attention,” arXiv preprint arXiv:2006.03654, 2020.
[12] Y. Liu et al., “Multilingual BERT: Effective Pretraining for Low-resource Languages,” arXiv preprint arXiv:1901.07291, 2019.
[13] A. Conneau et al., “XLM-R: Robust Cross-lingual Representation Learning at Scale,” arXiv preprint arXiv:1911.02116, 2019.
[14] S. Malmasi et al., “A Study of Stylometric and Lexical Features for Identifying Machine Generated Text,” in Proc. NAACL-HLT, 2020.
[15] P. Potthast et al., “A Stylometric Inquiry into Hyperpartisan and Fake News,” in Proc. ACL, 2018.
[16] M. Dou et al., “GPT detectors are biased against non-native English writers,” arXiv preprint arXiv:2304.02819, 2023.
[17] R. Brown et al., “DetectGPT: Zero-shot Detection of Generated Text via Probability Curvature,” arXiv preprint arXiv:2301.11305, 2023.
[18] J. Kirchner, “How ChatGPT Hijacks the Essay,” The Atlantic, Jan. 2023.
[19] A. Ali and H. Mehmood, “Fake news detection in low-resource languages: A case study on Urdu,” arXiv preprint arXiv:2302.08754, 2023.
[20] Muhammad Ammar (2025). Urdu Human and AI text Dataset (UHAT). IEEE Dataport. https://dx.doi.org/10.21227/y77y-9917
[1] M. C. Johnson, P. Patel, A. Ayers, and K. M. Spears, “Resource Management Challenges in Rural Dermatological Care: A Mapping Review,” Cureus, vol. 17, no. 1, Jan. 2025, doi: 10.7759/cureus.77544.
[2] F. Basholli, M. R. Hayal, E. E. Elsayed, and D. A. Juraev, “Deep Learning for Skin Disease Classification: A Comparative Study of CNN and CNN-LSTM Architectures,” J. Comput. Data Technol., vol. 1, no. 1, pp. 40–49, 2025, doi: 10.71426/jcdt.v1.i1.pp40-49.
[3] G. Rehman, H. Shahab, A. Maqbool, and S. Hussain, “DEVELOPMENT OF AN IOT-BASED REAL-TIME PATIENT HEALTH MONITORING SYSTEM,” Pakistan J. Sci. Res., vol. 5, no. 02, pp. 170–174, 2025.
[4] B. Cassidy, C. Kendrick, A. Brodzicki, J. Jaworek-Korjakowska, and M. H. Yap, “Analysis of the ISIC image datasets: Usage, benchmarks and recommendations,” Med. Image Anal., vol. 75, p. 102305, 2022, doi: 10.1016/j.media.2021.102305.
[5] F. S. Malik, M. H. Yousaf, H. A. Sial, and S. Viriri, “Exploring dermoscopic structures for melanoma lesions’ classification,” Front. Big Data, vol. 7, p. 1366312, 2024, doi: 10.3389/fdata.2024.1366312.
[6] S. M. Thwin and H. S. Park, “Skin Lesion Classification Using a Deep Ensemble Model,” Appl. Sci., vol. 14, no. 13, p. 5599, 2024, doi: 10.3390/app14135599.
[7] Y. Doğan and C. Özdemir, “Enhancing Skin Cancer Diagnosis through the Integration of Deep Learning and Machine Learning Approaches,” Bilişim Teknol. Derg., vol. 17, no. 4, pp. 339–347, 2024, doi: 10.17671/gazibtd.1484037.
[8] P. Hermosilla, R. Soto, E. Vega, C. Suazo, and J. Ponce, “Skin Cancer Detection and Classification Using Neural Network Algorithms: A Systematic Review,” Diagnostics, vol. 14, no. 4, p. 454, 2024, doi: 10.3390/diagnostics14040454.
[9] Z. R. Cai et al., “Assessing the performance of artificial intelligence models in evaluating inflammatory skin disease severity: a systematic review and meta-analysis,” Br. J. Dermatol., vol. 193, no. 5, pp. 847–855, 2025, doi: 10.1093/bjd/ljaf250.
[10] A. Aboulmira et al., “SkinHealthMate app: An AI-powered digital platform for skin disease diagnosis,” Syst. Soft Comput., vol. 6, p. 200166, 2024, doi: 10.1016/j.sasc.2024.200166.
[11] B. Ozdemir and I. Pacal, “A robust deep learning framework for multiclass skin cancer classification,” Sci. Rep., vol. 15, no. 1, p. 4938, 2025, doi: 10.1038/s41598-025-89230-7.
[12] J. Mohan, A. Sivasubramanian, S. V., and V. Ravi, “Enhancing skin disease classification leveraging transformer-based deep learning architectures and explainable AI,” Comput. Biol. Med., vol. 190, p. 110007, 2025, doi: 10.1016/j.compbiomed.2025.110007.
[13] M. Arshad, M. A. Khan, N. A. Almujally, A. Alasiry, M. Marzougui, and Y. Nam, “Multiclass skin lesion classification and localziation from dermoscopic images using a novel network-level fused deep architecture and explainable artificial intelligence,” BMC Med. Inform. Decis. Mak., vol. 25, no. 1, p. 215, 2025, doi: 10.1186/s12911-025-03051-2.
[14] K. Nawaz et al., “Skin cancer detection using dermoscopic images with convolutional neural network,” Sci. Rep., vol. 15, no. 1, p. 7252, Mar. 2025, doi: 10.1038/s41598-025-91446-6.
[15] S. Fatima, M. U. Akram, S. Mohammad, and S. Bin Ahmed, “Deep learning in dermatopathology: applications for skin disease diagnosis and classification,” Discov. Appl. Sci., vol. 7, no. 9, p. 1006, 2025, doi: 10.1007/s42452-025-07138-3.
[16] Abbas, M. A. (2025). Advanced Synthesis and Multifunctional Characterization of Neodymium-Doped Ba₂NiCoFe₂₈₋ ₓO₄₆ X-Type Hexagonal Ferrites: A Comprehensive Study of Structural, Morphological, and Electromagnetic Properties. Sch Acad J Biosci, 8, 1213-1227.
[17] Abbas, M. A., Junaid, M. J. M., Rasool, M. S., & Mahar, J. (2025). Structural and NLO Properties of Novel Organic 4-Bromo-4-Nitrostilbene Crystal: Experimental and DFT Study. International Research Journal of Management and Social Sciences, 6(4), 1-20.
[18] Abbas, M. A., Junaid, M. J. M., Rasool, M. S., & Mahar, J. (2025). Structural and NLO Properties of Novel Organic 4-Bromo-4-Nitrostilbene Crystal: Experimental and DFT Study. International Research Journal of Management and Social Sciences, 6(4), 1-20.
[19] Atif, H. M., Shahzad, A., Khan, M. Z., Abbas, M. A., & Mahar, J. (2025). Design of Novel drug as Potential Anti-Prostate Cancer Activity: Thiophene Derivatives against prostate cancer cell line as therapeutic agents using Pharmacokinetics molecular docking and DFT studies. Indus Journal of Bioscience Research, 3(6), 548-559.
[20] Abbas, M. A., Khan, M. Z., Atif, H. M., Shahzad, A., & Mahar, J. (2025). Computer-Aided Analysis of Oxino-bis-Pyrazolederivative as a Potential Breast Cancer Drug Based on DFT, Molecular Docking, and Pharmacokinetic Studies: Compared with the Standard Drug Tamoxifen. Indus Journal of Bioscience Research, 3(6), 535-537.
[21] Abbas, M. A., & Rasool, M. S. (2026). Eco-Friendly Synthesis of Ag–Co3O4 Nanoparticles for Visible-Light Photocatalysis and DFT-Based Nonlinear Optical Investigation. Chemical Technology and Engineering Applications, 1(1), 23-34.
[22] Rasool, M. S., Abbas, M. A., Khan, M. J., Mahar, J., & Khan, M. Z. IDENTIFICATION OF NATURAL EGFR TYROSINE KINASE INHIBITORS FROM CHENOPODIUM QUINOA WILLD. VIA COMBINATORIAL IN SILICO AND PHARMACOLOGICAL SCREENING.
[23] Akram, S., Abbas, M. A., Mahar, J., Rasool, M. S., & Junaid, M. INTERFACIAL DEFECT PASSIVATION AND PHOTOPHYSICAL ENGINEERING OF CSPBCL₃ QUANTUM DOTS VIA BISBENZIMIDAZOLIUM LIGANDS FOR ADVANCED ELECTRONIC DEVICES.
[24] Junaid, M., Rasool, M. S., Abbas, M. A., & Mahar, J. (2024). Formulation Development and Evaluation of a Bilayered Tablet Containing Dapagliflozin and Metformin. Global Research Journal of Natural Science and Technology, 2(3).
[25] Amin, M., Abbas, M. A., Mahar, J., Shahzad, M. S., & Rasool, M. S. (2026). Phyto-Mediated Green Synthesis and Physicochemical Characterization of Titanium Dioxide Nanoparticles for Environmental and Pharmacological Applications. Journal of Physical and Chemical Studies (JPCS), 1(4), 17–56. https://doi.org/10.5281/zenodo.19767807
[26] Abbas, M. A., Mahar, J., Ali, N., Junaid, M., & Rasool, M. S. (2026). Green Synthesis of SnO₂ Nanomaterials: Photocatalytic Degradation of Methylene Blue and DFT-Based Investigation of Nonlinear Optical Properties. Journal of Physical and Chemical Studies (JPCS), 1(3), 1–29. https://doi.org/10.5281/zenodo.19693725
[27] Abbas, M. A., Mahar, J., Ali, N., Junaid, M., & Rasool, M. S. (2026). Photocatalytic Dynamics of Organic Dye Degradation on Graphitic Carbon Nitride: An Integrated Experimental and Theoretical Investigation. Journal of Physical and Chemical Studies (JPCS), 1(2), 1–23. https://doi.org/10.5281/zenodo.19693515
[28] Abbas, M. A., Mahar, J., Ali, N., Junaid, M., & Rasool, M. S. (2026). Interfacial Defect Passivation and Photophysical Modulation in Cesium Lead Chloride Perovskite Quantum Dots Using Bisbenzimidazolium Ligands for Advanced Optoelectronic Devices. Journal of Physical and Chemical Studies (JPCS), 1(1), 1–18. https://doi.org/10.5281/zenodo.19666800
[29] Akram, S., Abbas, M. A., Mahar, J., Rasool, M. S., & Junaid, M. (2026). SYNTHESIS AND CHARACTERIZATION OF ZINC-DOPED CARBON DOTS FOR ENHANCED FLUORESCENCE APPLICATIONS. Policy Research Journal, 4(2), 168–177. https://policyrj.com/1/article/view/1550
Downloads
Published
Data Availability Statement
Data available upon reasonable request from the corresponding authorIssue
Section
License
Copyright (c) 2026 Shaista Jabeen, Maria Jafar, Usman Shafeeq, Palwasha Urooj Ch, Ghulam Muhy Ud Deen Raee, Hamid Ghous, Mubasher Hussain Malik (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
Articles published in the NextGen AI & Computing Journal (NAC) are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). This license permits anyone to copy, redistribute, remix, transmit, and adapt the work, even commercially, provided the original work and source are appropriately cited. Under this license, authors retain full copyright of their research while granting the journal the right of first publication.
