Publikation

Self-improving Scene Understanding with Vision-Language Knowledge Integration [Extended Abstract]

Aliki Anagnostopoulou; Hasan Md Tusfiqur Alam; Daniel Sonntag

MIND workshop at the IUI'25 Conference, 3/2025.

Zusammenfassung

We propose an approach for personalised and contextualised image captioning. As pre-trained vision-language systems fail to capture details about the user’s intent, occasion, and other information related to the image, we envision a system that addresses these limitations. This approach has two key components for which we need to find suitable practical implementations: multimodal RAG and automatic prompt engineering. We outline our idea and review different possibilities to address these tasks.

Projekte

No-IDLE - Interactive Deep Learning Enterprise
NoIDLEChatGPT - No-IDLE meets ChatGPT

ssuvlaki_extended_abstract.pdf (pdf, 496 KB )