![]()
Speech Emotion Recognition Using MFCC Features and LSTM-Based Deep Learning
Devraj1, Ravindra Nath2, Nikita Singh3, Vibhushit Katiyar4, Amber Srivastava5
1Dr. Devraj, Department of Division of Social Science, ICAR-Indian Institute of Pulses Research, Kanpur (Uttar Pradesh), India.
2Dr. Ravindra Nath, Associate Professor, Department of Computer Science, Babasaheb Bhimrao Ambedkar Central University, Lucknow (Uttar Pradesh), India.
3Nikita Singh, Department of Computer Centre, ICAR-Indian Institute of Pulses Research, Kanpur (Uttar Pradesh), India.
4Vibhushit Katiyar, Student, Department of Computer Science, B.Tech.(CS) Student, Lovely Professional University, Jalandhar (Panjab), India.
5Amber Srivastava, Student, Department of Computer Science, B.Tech.(CS) Student, Lovely Professional University, Jalandhar (Panjab), India.
Manuscript received on 27 January 2026 | First Revised Manuscript received on 06 February 2026 | Second Revised Manuscript received on 10 February 2026 | Manuscript Accepted on 15 February 2026 | Manuscript published on 28 February 2026 | PP: 1-6 | Volume-6 Issue-1, February 2026 | Retrieval Number: 100.1/ijsp.A101706010226 | DOI: 10.54105/ijsp.A1017.06010226
Open Access | Editorial and Publishing Policies | Cite | Zenodo | OJS | Indexing and Abstracting
© The Authors. Published by Lattice Science Publication (LSP). This is an open-access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Speech Emotion Recognition (SER) has emerged as a significant research area within Human–Computer Interaction (HCI), enabling intelligent systems to interpret human emotional states from spoken audio. Accurate emotion recognition from speech plays a crucial role in enhancing natural interaction between humans and machines. This paper presents a deep learning–based SER framework that combines Mel-Frequency Cepstral Coefficients (MFCC) for feature extraction with Long Short-Term Memory (LSTM) networks for temporal modelling and emotion classification. MFCC features effectively capture the spectral characteristics of speech signals, whereas LSTM networks are well-suited to modelling long-term temporal dependencies inherent in emotional speech patterns. The proposed model is trained and evaluated on the Toronto Emotional Speech Set (TESS) dataset, which covers multiple emotional categories, including happiness, sadness, anger, fear, and neutrality. Experimental results demonstrate that the proposed MFCC–LSTM approach achieves promising classification accuracy, indicating its effectiveness in recognising emotional states from speech signals. The findings highlight the potential applicability of the proposed system in real-world scenarios, including virtual assistants, call centre analytics, and mental health monitoring systems, thereby contributing to the development of emotion-aware intelligent interfaces.
Keywords: Speech Emotion Recognition, MFCC, LSTM, Deep Learning, TESS Dataset, Human–Computer Interaction, Audio Signal Processing.
Scope of the Article: Audio Signal Processing
