Publikation

Exploring Foundation Model Fusion Effectiveness and Explainability for Stylistic Analysis of Emotional Podcast Data

Arnab Das; Carlos Franzreb; Tim Polzehl; Sebastian Möller

In: Advances in Information and Communication. Future of Information and Communication Conference (FICC-2025), located at FICC-2025, March 4-5, Berlin, Germany, Springer Nature, Switzerland, 2025.

Zusammenfassung

Emotion recognition is one of the crucial research fields for advancing affective computing. Automatic prediction using deep learning models shows poor performance while predicting valence/polarity for spoken utterances. In this paper, we investigate the effectiveness of emotion representations from a recent weakly supervised multilingual large automatic speech recognition (ASR) model along with two other self-supervised pre-trained general purpose foundation models for dimensional emotion recognition tasks from speech. We also propose a fusion architecture and demonstrate that the proposed method can achieve significantly better results compared to a state-of-the-art baseline. Moreover, we train our model with additional pairwise rank loss to further improve the prediction reliability. We further attempt to explain the prediction results using post-hoc occlusion methods demonstrating a strong relationship between the contextual construct of language and valence/polarity. Finally, we perform a comprehensive exploration of the data and labels and identify instances of verbal irony causal for individual prediction failure.

Projekte

Medinym - KI-basierte Anonymisierung personenbezogener Patientendaten in klinischen Text- und Sprachdatenbeständen

Weitere Links

https://link.springer.com/chapter/10.1007/978-3-031-84457-7_23

SAI_PodcastEmotionReco_CameraReady_(1).pdf (pdf, 495 KB )