Publikation
Exploring Foundation Model Fusion Effectiveness and Explainability for Stylistic Analysis of Emotional Podcast Data
Arnab Das; Carlos Franzreb; Tim Polzehl; Sebastian Möller
In: Advances in Information and Communication. Future of Information and Communication Conference (FICC-2025), located at FICC-2025, March 4-5, Berlin, Germany, Springer Nature, Switzerland, 2025.
Zusammenfassung
Emotion recognition is one of the crucial research fields for advancing affective computing. Automatic prediction using deep learning models shows poor performance while predicting valence/polarity for spoken utterances. In this paper, we investigate the effectiveness of emotion representations from a recent weakly supervised multilingual large automatic speech recognition (ASR) model along with two other self-supervised pre-trained general purpose foundation models for dimensional emotion recognition tasks from speech. We also propose a fusion architecture and demonstrate that the proposed method can achieve significantly better results compared to a state-of-the-art baseline. Moreover, we train our model with additional pairwise rank loss to further improve the prediction reliability. We further attempt to explain the prediction results using post-hoc occlusion methods demonstrating a strong relationship between the contextual construct of language and valence/polarity. Finally, we perform a comprehensive exploration of the data and labels and identify instances of verbal irony causal for individual prediction failure.