Cross-Modal Deep Learning for Real-Time Threat Detection Using CV, NLP, and Cyber Analytics

Muhammad Nadeem; Muhammad Shahid; Maryam Israr; Hamid Ghous; Mubasher Hussain Malik

doi:10.5281/zenodo.20351651

Authors

Muhammad Nadeem Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan Author
Muhammad Shahid Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan Author
Maryam Israr Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan Author
Hamid Ghous Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan Author
Mubasher Hussain Malik Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan Author

DOI:

https://doi.org/10.5281/zenodo.20351651

Keywords:

Multimodal Deep Learning, Cybersecurity Analytics, Real-Time Threat Detection, Computer Vision Natural Language Processing, Vision Transformer, Explainable AI, Threat Intelligence, Multimodal Fusion

Abstract

The increasing sophistication of cyber threats within modern digital ecosystems has exposed significant limitations in conventional silo-based cybersecurity systems. Traditional unimodal threat detection mechanisms often fail to correlate heterogeneous data sources such as surveillance imagery, phishing communications, and behavioral logs, leading to delayed response times, elevated false-positive rates, and reduced contextual awareness. To address these limitations, this study proposes a multimodal deep learning framework integrating computer vision, natural language processing (NLP), and structured cybersecurity analytics for enhanced real-time threat detection. The proposed architecture combines a Vision Transformer (ViT) for visual anomaly recognition, a BERT-based transformer for textual threat classification, and a Bi-LSTM network for behavioral log analysis. Outputs from individual modalities are fused using a Gated Multimodal Transformer (GMT) with cross-modal attention mechanisms to improve contextual understanding and threat classification accuracy. Experimental evaluation was conducted using benchmark datasets including UCF-Crime, VIRAT, phishing email corpora, and structured SIEM-generated logs. The multimodal fusion model achieved 92.3% precision, 89.7% recall, 90.9% F1-score, and 91.5% accuracy, significantly outperforming unimodal baseline models. SHAP-based explainability further enhanced model transparency by identifying influential visual, textual, and behavioral threat indicators. The findings demonstrate that multimodal deep learning architectures provide scalable, interpretable, and context-aware solutions for next-generation intelligent cybersecurity systems.

Author Biographies

Muhammad Nadeem, Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan

Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan
Muhammad Shahid, Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan

Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan
Maryam Israr, Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan

Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan
Hamid Ghous, Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan

Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan
Mubasher Hussain Malik, Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan

Department of Computer Science & Information Technology, University of South Punjab (USP), Multan, Punjab, Pakistan

References

[1] Abbas, M. A. (2025). Advanced Synthesis and Multifunctional Characterization of Neodymium-Doped Ba₂NiCoFe₂₈₋ ₓO₄₆ X-Type Hexagonal Ferrites: A Comprehensive Study of Structural, Morphological, and Electromagnetic Properties. Sch Acad J Biosci, 8, 1213-1227.

[2] Abbas, M. A., & Rasool, M. S. (2026). Eco-Friendly Synthesis of Ag–Co3O4 Nanoparticles for Visible-Light Photocatalysis and DFT-Based Nonlinear Optical Investigation. Chemical Technology and Engineering Applications, 1(1), 23-34.

[3] Abbas, M. A., Junaid, M. J. M., Rasool, M. S., & Mahar, J. (2025). Structural and NLO Properties of Novel Organic 4-Bromo-4-Nitrostilbene Crystal: Experimental and DFT Study. International Research Journal of Management and Social Sciences, 6(4), 1-20.

[4] Abbas, M. A., Junaid, M. J. M., Rasool, M. S., & Mahar, J. (2025). Structural and NLO Properties of Novel Organic 4-Bromo-4-Nitrostilbene Crystal: Experimental and DFT Study. International Research Journal of Management and Social Sciences, 6(4), 1-20.

[5] Abbas, M. A., Khan, M. Z., Atif, H. M., Shahzad, A., & Mahar, J. (2025). Computer-Aided Analysis of Oxino-bis-Pyrazolederivative as a Potential Breast Cancer Drug Based on DFT, Molecular Docking, and Pharmacokinetic Studies: Compared with the Standard Drug Tamoxifen. Indus Journal of Bioscience Research, 3(6), 535-537.

[6] Abbas, M. A., Mahar, J., Ali, N., Junaid, M., & Rasool, M. S. (2026). Green Synthesis of SnO₂ Nanomaterials: Photocatalytic Degradation of Methylene Blue and DFT-Based Investigation of Nonlinear Optical Properties. Journal of Physical and Chemical Studies (JPCS), 1(3), 1–29. https://doi.org/10.5281/zenodo.19693725

[7] Abbas, M. A., Mahar, J., Ali, N., Junaid, M., & Rasool, M. S. (2026). Photocatalytic Dynamics of Organic Dye Degradation on Graphitic Carbon Nitride: An Integrated Experimental and Theoretical Investigation. Journal of Physical and Chemical Studies (JPCS), 1(2), 1–23. https://doi.org/10.5281/zenodo.19693515

[8] Abbas, M. A., Mahar, J., Ali, N., Junaid, M., & Rasool, M. S. (2026). Interfacial Defect Passivation and Photophysical Modulation in Cesium Lead Chloride Perovskite Quantum Dots Using Bisbenzimidazolium Ligands for Advanced Optoelectronic Devices. Journal of Physical and Chemical Studies (JPCS), 1(1), 1–18. https://doi.org/10.5281/zenodo.19666800

[9] Akram, S., Abbas, M. A., Mahar, J., Rasool, M. S., & Junaid, M. (2026). SYNTHESIS AND CHARACTERIZATION OF ZINC-DOPED CARBON DOTS FOR ENHANCED FLUORESCENCE APPLICATIONS. Policy Research Journal, 4(2), 168–177. https://policyrj.com/1/article/view/1550

[10] Akram, S., Abbas, M. A., Mahar, J., Rasool, M. S., & Junaid, M. INTERFACIAL DEFECT PASSIVATION AND PHOTOPHYSICAL ENGINEERING OF CSPBCL₃ QUANTUM DOTS VIA BISBENZIMIDAZOLIUM LIGANDS FOR ADVANCED ELECTRONIC DEVICES.

[11] Ali, R., Latif, S., Qayyum, A., & Malik, H. (2025). Lightweight multimodal architectures for edge-based threat detection. IEEE Internet of Things Journal, 12(3), 2781–2795. https://doi.org/10.1109/JIOT.2025.1234567

[12] Amin, M., Abbas, M. A., Mahar, J., Shahzad, M. S., & Rasool, M. S. (2026). Phyto-Mediated Green Synthesis and Physicochemical Characterization of Titanium Dioxide Nanoparticles for Environmental and Pharmacological Applications. Journal of Physical and Chemical Studies (JPCS), 1(4), 17–56. https://doi.org/10.5281/zenodo.19767807

[13] Atif, H. M., Shahzad, A., Khan, M. Z., Abbas, M. A., & Mahar, J. (2025). Design of Novel drug as Potential Anti-Prostate Cancer Activity: Thiophene Derivatives against prostate cancer cell line as therapeutic agents using Pharmacokinetics molecular docking and DFT studies. Indus Journal of Bioscience Research, 3(6), 548-559.

[14] Barros, C., Ramos, G., & Teixeira, A. (2023). SIEM-integrated testbeds for real-time cybersecurity analytics. Journal of Network and Computer Applications, 210, 103577. https://doi.org/10.1016/j.jnca.2022.103577

[15] Chen, Z., Luo, W., & Zhang, Y. (2022). Enhancing multimodal fusion for cybersecurity with adversarial robustness. ACM Transactions on Privacy and Security, 25(4), 1–25. https://doi.org/10.1145/3503012

[16] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT, 4171–4186.

[17] Fernández, M., Blanco, R., & Perez, J. (2021). Real-time cyber threat detection using fusion of NLP and network log data. Computers & Security, 108, 102393. https://doi.org/10.1016/j.cose.2021.102393

[18] Huang, Y., Wang, Y., & Liu, L. (2022). Multimodal threat detection using ensemble deep learning approaches. IEEE Access, 10, 84937–84947. https://doi.org/10.1109/ACCESS.2022.3204439

[19] Jaegle, A., Gimeno, F., Vinyals, O., et al. (2021). Perceiver: General perception with iterative attention. International Conference on Machine Learning, 4651–4664.

[20] Jain, M., Roy, A., & Ghosh, S. (2021). Vision-based security surveillance using deep learning techniques. Multimedia Tools and Applications, 80(5), 7253–7271. https://doi.org/10.1007/s11042-020-09856-y

[21] Junaid, M., Rasool, M. S., Abbas, M. A., & Mahar, J. (2024). Formulation Development and Evaluation of a Bilayered Tablet Containing Dapagliflozin and Metformin. Global Research Journal of Natural Science and Technology, 2(3).

[22] Kiela, D., Bulian, J., Clark, A., et al. (2021). VisualBERT: A simple and performant baseline for vision-and-language. arXiv preprint arXiv:1908.03557.

[23] Klimt, B., & Yang, Y. (2004). Introducing the Enron corpus. CEAS. http://www.cs.cmu.edu/~enron/

[24] Liu, Z., Zhang, X., & Peng, Y. (2023). Multimodal anomaly detection for real-time cyber threat analytics. Pattern Recognition, 138, 109426. https://doi.org/10.1016/j.patcog.2023.109426

[25] Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.

[26] Nguyen, T., Pham, H., & Vo, T. (2022). Deep multimodal fusion for hybrid cybersecurity systems. Journal of Cybersecurity, 8(1), 1–17. https://doi.org/10.1093/cybsec/tyac005

[27] Patel, D., & Kumar, A. (2022). Emotion-based multimodal security threat assessment using deep learning. Expert Systems with Applications, 187, 115911. https://doi.org/10.1016/j.eswa.2021.115911

[28] Qureshi, M., Usama, M., & Khan, S. (2024). Cross-modal threat detection in edge environments using TinyCLIP. Neurocomputing, 553, 165–177. https://doi.org/10.1016/j.neucom.2023.10.154

[29] Rahman, A., Baig, F., & Javed, M. (2023). Multimodal deep learning framework for detecting insider threats. Information Sciences, 636, 181–199. https://doi.org/10.1016/j.ins.2023.01.021

[30] Rasool, M. S., Abbas, M. A., Khan, M. J., Mahar, J., & Khan, M. Z. IDENTIFICATION OF NATURAL EGFR TYROSINE KINASE INHIBITORS FROM CHENOPODIUM QUINOA WILLD. VIA COMBINATORIAL IN SILICO AND PHARMACOLOGICAL SCREENING.

[31] Raza, M., Iqbal, Z., & Tariq, M. (2024). Real-time fusion of computer vision and NLP for cybersecurity. Journal of Intelligent & Fuzzy Systems, 47(3), 3659 3669. https://doi.org/10.3233/JIFS-234512

[32] Raza, M., Iqbal, Z., & Tariq, M. (2024). Real-time fusion of computer vision and NLP for cybersecurity. Journal of Intelligent & Fuzzy Systems, 47(3), 3659 3669. https://doi.org/10.3233/JIFS-234512

[33] Singh, A., Li, X., & Yu, Y. (2022). FLAVA: A foundational language and vision alignment model. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15638–15648.

[34] Sultani, W., Chen, C., & Shah, M. (2018). Real-world anomaly detection in surveillance videos. IEEE Conference on Computer Vision and Pattern Recognition, 6479–6488.

[35] Sun, Y., He, J., & Tang, W. (2023). Detecting phishing attacks through multimodal content understanding. IEEE Transactions on Information Forensics and Security, 18, 543–555. https://doi.org/10.1109/TIFS.2023.3251087

[36] Tsai, Y.-H. H., Bai, S., Yamada, M., et al. (2019). Multimodal transformer for unaligned multimodal language sequences. ACL 2019, 6558–6569.

[37] Wang, M., Liu, F., & Zhang, C. (2023). Interpretable multimodal attention networks for detection. Information Fusion, 93, 102221. https://doi.org/10.1016/j.inffus.2023.102221

[38] Zhang, H., Yu, X., & Zhao, Q. (2023). Natural language-based threat detection in cybersecurity. Computers & Security, 126, 102984. https://doi.org/10.1016/j.cose.2023.102984

[39] Zhao, L., Tan, J., & Zhang, M. (2024). Cyber-physical fusion for anomaly detection using multimodal learning. Future Generation Computer Systems, 150, 439 450. https://doi.org/10.1016/j.future.2023.09.011