Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Sarala Padi; Omid Sadjadi; Ram D. Sriram; Dinesh Manocha

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Published

August 5, 2021

Author(s)

Sarala Padi, Omid Sadjadi, Ram D. Sriram, Dinesh Manocha

Abstract

Automatic speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity, i.e., insufficient amounts of carefully labeled data to build and fully explore complex deep learning models for emotion classification. This paper aims to address this challenge using a transfer learning strategy combined with spectrogram augmentation. Specifically, we propose a transfer learning approach that leverages a pre-trained residual network (ResNet) model including a statistics pooling layer from speaker recognition trained using large amounts of speaker-labeled data. The statistics pooling layer enables the model to efficiently process variable-length input, thereby eliminating the need for sequence truncation which is commonly used in SER systems. In addition, we adopt a spectrogram augmentation technique to generate additional training data samples by applying random time-frequency masks to log-mel spectrograms to mitigate overfitting and improve the generalization of emotion recognition models. We evaluate the effectiveness of our proposed approach on the interactive emotional dyadic motion capture (IEMOCAP) dataset. Experimental results indicate that the transfer learning and spectrogram augmentation approaches improve the SER performance, and when combined achieve state-of-the-art results.

Proceedings Title

International Conference on Multimodal Interaction (ICMI ’21)

Conference Dates

October 18-22, 2021

Conference Location

Montreal, CA

Conference Title

ACM International Conference on Multimodal Interaction

Pub Type

Conferences

Download Paper

Local Download

Keywords

IEMOCAP, speech emotion recognition (SER), transfer learning, spectrogram augmentation, ResNet

Image and signal processing, Human language technology and Artificial intelligence

Citation

Padi, S. , Sadjadi, O. , Sriram, R. and Manocha, D. (2021), Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation, International Conference on Multimodal Interaction (ICMI ’21), Montreal, CA, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=932172 (Accessed August 8, 2025)

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created August 5, 2021, Updated November 29, 2022

Was this page helpful?

Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Author(s)

Abstract

Download Paper

Keywords

Citation

Additional citation formats

Issues