Benchmarking Classical, Deep Learning, and Transformer Models for Hindi Speech Emotion Recognition: A Multimodal Analysis
Main Article Content
Abstract
The area of Speech Emotion Recognition (SER) is one that is critical to building intelligent devices and systems that are designed to be useful and aware of the user or human perspective. While there has been significant research into SER systems in English and European languages, the same level of research does not exist for the SER of Hindi, particularly in applying transformer architectures. This paper includes an extensive comparative analysis of classical machine-learning models, deep-learning architectures, and transformer-based networks on Hindi SER using a single evaluation framework. A created Hindi emotional speech dataset has also been prepared through pre-processing, technical acoustic pre-processing, and feature extraction, in both Mel-spectrogram and raw waveform formats. The following models have been trained/evaluated: classical machine-learning (SVM, Random Forest, Gradient Boosting) models, deep-learning (Convolutional Neural Network (CNN), CNN-Bi-LSTM, Attention-enhanced networks) models, and transformer models (e.g. Wav2Vec2.0, HuBERT, Vision Transformer (ViT), Swin Transformer (Swin-T)), using uniform training-validation-testing configurations. The results of our experiments indicate a continuous progression in performance across the various families of models, with the transformer models outperforming all others with the highest accuracy (93.4%) and macro-F1 score, followed by the deep-learning and classical models. In addition to providing a foundation for future studies of Hindi SER, error analyses reveal an increase in the capability to separate subtle emotions (e.g. sadness and fear) by using transformer-generated embeddings. This paper provides a solid empirical and methodological foundation for future Hindi SER research and highlights major opportunities for the lightweight deployment of Hindi SER systems and opportunities for multimodal systems.
Downloads
Article Details
Section

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Interdisciplinary Journal of AI, Machine Learning & Data Science (IJAIMLDS) are licensed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
This license allows others to share, copy, distribute, and adapt the work, provided that proper credit is given to the original author(s) and the source.
Authors retain copyright and grant Interdisciplinary Journal of AI, Machine Learning & Data Science (IJAIMLDS) the right of first publication.
How to Cite
References
1. Agrawal, A., & Jain, A. (2020). Speech emotion recognition of Hindi speech using statistical and machine learning techniques. Journal of Interdisciplinary Mathematics, 23(1), 311-319. DOI: https://doi.org/10.1080/09720502.2020.1721926
2. Koolagudi, S. G., Reddy, R., Yadav, J., & Rao, K. S. (2011, February). IITKGP-SEHSC: Hindi speech corpus for emotion analysis. In 2011 International conference on devices and communications (ICDeCom) (pp. 1-5). IEEE. DOI: https://doi.org/10.1109/ICDECOM.2011.5738540
3. Shashank, B., Shankar, B., Chandresh, L., & Jayashree, R. (2021). Emotion recognition in Hindi speech using CNN-LSTM model. In Modern Approaches in Machine Learning and Cognitive Science: A Walkthrough: Latest Trends in AI, Volume 2 (pp. 13-22). Cham: Springer International Publishing. DOI: https://doi.org/10.1007/978-3-030-68291-0_2
4. Jaiswal, V. K., Harikala, T., Madhavi, K. R., & Sudhakara, M. (2025). A Deep Neural Framework for Emotion Detection in Hindi Textual Data. International Journal of Interpreting Enigma Engineers (IJIEE), 2(2), 36-47. DOI: https://doi.org/10.62674/ijiee.2025.v2i02.005
5. Kawade, R., & Jagtap, S. (2024). Indian cross corpus speech emotion recognition using multiple spectral-temporal-voice quality acoustic features and deep convolution neural network. RIA, 38, 913-27. DOI: https://doi.org/10.18280/ria.380318
6. Madanian, S., Chen, T., Adeleye, O., Templeton, J. M., Poellabauer, C., Parry, D., & Schneider, S. L. (2023). Speech emotion recognition using machine learning—A systematic review. Intelligent systems with applications, 20, 200266.
7. Radhika, S., Prasanth, A., & Sowndarya, K. D. (2025). A Reliable speech emotion recognition framework for multi-regional languages using optimized light gradient boosting machine classifier. Biomedical Signal Processing and Control, 105, 107636. DOI: https://doi.org/10.1016/j.bspc.2025.107636
8. Kumar, T., Mahrishi, M., & Sharma, G. (2023). Emotion recognition in Hindi text using multilingual BERT transformer. Multimedia Tools and Applications, 82(27), 42373-42394. DOI: https://doi.org/10.1007/s11042-023-15150-1
9. Wadhawan, A., & Aggarwal, A. (2021). Towards emotion recognition in hindi-english code-mixed data: A transformer based approach. arXiv preprint arXiv:2102.09943.
10. Chaudhari, A., Bhatt, C., Nguyen, T. T., Patel, N., Chavda, K., & Sarda, K. (2023). Emotion recognition system via facial expressions and speech using machine learning and deep learning techniques. SN Computer Science, 4(4), 363. DOI: https://doi.org/10.1007/s42979-022-01633-9
11. Gambhir, P., Dev, A., Bansal, P., Sharma, D. K., & Gupta, D. (2024). Residual networks for text-independent speaker identification: Unleashing the power of residual learning. Journal of Information Security and Applications, 80, 103665. DOI: https://doi.org/10.1016/j.jisa.2023.103665
12. Khare, B. K., & Khan, I. (2024, March). Transforming Emotions: A Comprehensive Review of Text Emotion Detection with Transformer Models. In International Conference on Emerging Trends and Technologies on Intelligent Systems (pp. 515-534). Singapore: Springer Nature Singapore. DOI: https://doi.org/10.1007/978-981-97-5703-9_43
13. Kessous, L., Castellano, G., & Caridakis, G. (2010). Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis. Journal on Multimodal User Interfaces, 3(1), 33-48. DOI: https://doi.org/10.1007/s12193-009-0025-5
14. Liu, Y., Chen, A., Zhou, G., Yi, J., Xiang, J., & Wang, Y. (2024). Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion. Multimedia Tools and Applications, 83(21), 59839-59859. DOI: https://doi.org/10.1007/s11042-023-17829-x
15. Lian, H., Lu, C., Li, S., Zhao, Y., Tang, C., & Zong, Y. (2023). A survey of deep learning-based multimodal emotion recognition: Speech, text, and face. Entropy, 25(10), 1440. DOI: https://doi.org/10.3390/e25101440
16. Madanian, S., Chen, T., Adeleye, O., Templeton, J. M., Poellabauer, C., Parry, D., & Schneider, S. L. (2023). Speech emotion recognition using machine learning—A systematic review. Intelligent systems with applications, 20, 200266. DOI: https://doi.org/10.1016/j.iswa.2023.200266
17. Huang, X., Lin, W., Chen, M., & Shi, H. (2025). Hybrid-Module Transformer: enhancing speech emotion recognition with HuBERT, LSTM, and ResNet-50. PeerJ Computer Science, 11, e3292. DOI: https://doi.org/10.7717/peerj-cs.3292
18. Liao, Z., & Shen, S. (2023, May). Speech emotion recognition based on swin-transformer. In Journal of Physics: Conference Series (Vol. 2508, No. 1, p. 012056). IOP Publishing. DOI: https://doi.org/10.1088/1742-6596/2508/1/012056
19. Wang, Y., Lu, C., Lian, H., Zhao, Y., Schuller, B. W., Zong, Y., & Zheng, W. (2024, April). Speech swin-transformer: Exploring a hierarchical transformer with shifted windows for speech emotion recognition. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 11646-11650). IEEE. DOI: https://doi.org/10.1109/ICASSP48485.2024.10447726
20. Ye, J., Wen, X. C., Wei, Y., Xu, Y., Liu, K., & Shan, H. (2023, June). Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition. In ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1-5). IEEE. DOI: https://doi.org/10.1109/ICASSP49357.2023.10096370
21. Song, Y., & Zhou, Q. (2024). Bi-Modal Bi-Task Emotion Recognition Based on Transformer Architecture. Applied Artificial Intelligence, 38(1), 2356992. DOI: https://doi.org/10.1080/08839514.2024.2356992
22. Wafa, A. A., Eldefrawi, M. M., & Farhan, M. S. (2025). Advancing multimodal emotion recognition in big data through prompt engineering and deep adaptive learning. Journal of Big Data, 12(1), 210. DOI: https://doi.org/10.1186/s40537-025-01264-w
23. Li, F., Luo, J., & Xia, W. (2025, January). WavFusion: towards wav2vec 2.0 multimodal speech emotion recognition. In International Conference on Multimedia Modeling (pp. 325-336). Singapore: Springer Nature Singapore. DOI: https://doi.org/10.1007/978-981-96-2071-5_24
24. Chatzichristodoulou, G., Kosmopoulou, D., Kritikos, A., Poulopoulou, A., Georgiou, E., Katsamanis, A., ... & Potamianos, A. (2025). MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions. arXiv preprint arXiv:2506.09556.
25. Chatzichristodoulou, G., Kosmopoulou, D., Kritikos, A., Poulopoulou, A., Georgiou, E., Katsamanis, A., ... & Potamianos, A. (2025). MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions. arXiv preprint arXiv:2506.09556. DOI: https://doi.org/10.21437/Interspeech.2025-2636
26. Akinpelu, S., Viriri, S., & Adegun, A. (2024). An enhanced speech emotion recognition using vision transformer. Scientific Reports, 14(1), 13126. DOI: https://doi.org/10.1038/s41598-024-63776-4
27. Kumar Nayak, S., Kumar Nayak, A., Mishra, S., Mohanty, P., Tripathy, N., & Surjeet Chaudhury, K. (2025). Exploring Speech Emotion Recognition in Tribal Language with Deep Learning Techniques. International journal of electrical and computer engineering systems, 16(1), 53-64. DOI: https://doi.org/10.32985/ijeces.16.1.6
28. Nayak, S. K., Nayak, A. K., Mishra, S., Tripathy, N., Dalai, S. S., & Tripathy, J. (2024, November). Speech Emotion Recognition for a Tribal Language using Machine Learning Methods. In 2024 International Conference on Intelligent Computing and Sustainable Innovations in Technology (IC-SIT) (pp. 1-6). IEEE. DOI: https://doi.org/10.1109/IC-SIT63503.2024.10862147
29. Al-Asadi, M., Hameed, A. A., Lafta, J. H., Hussein, H. L., & Al-Azzawi, M. Comprehensive Analysis of Speech Emotion Recognition: Models, Methods, and Applications in Intelligent Interaction. Mamta Mittal, 21. DOI: https://doi.org/10.1007/978-981-95-2129-6_2
30. Rintala, J. (2020). Speech emotion recognition from raw audio using deep learning.
31. Rintala, J. (2020). Speech emotion recognition from raw audio using deep learning.
32. Alroobaea, R. (2024). Cross-corpus speech emotion recognition with transformers: Leveraging handcrafted features and data augmentation. Computers in Biology and Medicine, 179, 108841. DOI: https://doi.org/10.1016/j.compbiomed.2024.108841