Affective Image Captioning: Extraction and Semantic Arrangement of Image Information with Deep Neural Networks

Tushar Karayil

PhD-Thesis TU Kaiserslautern kluedo 6/2020.


In recent years, the Internet has become a major source of visual information exchange. Popular social platforms have reported an average of 80 million photo uploads a day. These images, are often accompanied with a user provided text one-liner, called an image caption. Deep Learning techniques have made significant advances towards automatic generation of factual image captions. However, captions generated by humans are much more than mere factual image descriptions. This work takes a step towards enhancing a machine's ability to generate image captions with human-like properties. We name this field as Affective Image Captioning, to differentiate it from the other areas of research focused on generating factual descriptions. To deepen our understanding of human generated captions, we first perform a large-scale Crowd-Sourcing study on a subset of Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M). Three thousand random image-caption pairs were evaluated by native English speakers w.r.t different dimensions like focus, intent, emotion, meaning, and visibility. Our findings indicate three important underlying properties of human captions: subjectivity, sentiment, and variability. Based on these results, we develop Deep Learning models to address each of these dimensions. To address the subjectivity dimension, we propose the Focus-Aspect-Value (FAV) model (along with a new task of aspect-detection) to structure the process of capturing subjectivity. We also introduce a novel dataset, aspects-DB, following this way of modeling. To implement the model, we propose a novel architecture called Tensor Fusion. Our experiments show that Tensor Fusion outperforms the state-of-the-art cross residual networks (XResNet) in aspect-detection. Towards the sentiment dimension, we propose two models:Concept & Syntax Transition Network (CAST) and Show & Tell with Emotions (STEM). The CAST model uses a graphical structure to generate sentiment. The STEM model uses a neural network to inject adjectives into a neutral caption. Achieving a high score of 93% with human evaluation, these models were selected as the top-3 at the ACMMM Grand Challenge 2016. To address the last dimension, variability, we take a generative approach called Generative Adversarial Networks (GAN) along with multimodal fusion. Our modified GAN, with two discriminators, is trained using Reinforcement Learning. We also show that it is possible to control the properties of the generated caption-variations with an external signal. Using sentiment as the external signal, we show that we can easily outperform state-of-the-art sentiment caption models.


Weitere Links

Dissertation_Tushar_Karayil_2020.pdf (pdf, 40 MB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence