Publikation
Self-improving Scene Understanding with Vision-Language Knowledge Integration [Extended Abstract]
Aliki Anagnostopoulou; Hasan Md Tusfiqur Alam; Daniel Sonntag
MIND workshop at the IUI'25 Conference, 3/2025.
Zusammenfassung
We propose an approach for personalised and contextualised image captioning. As pre-trained vision-language systems fail to capture details about the user’s intent, occasion, and other information related to the image, we envision a system that addresses these limitations. This approach has two key components for which we need to find suitable practical implementations: multimodal RAG and automatic prompt engineering. We outline our idea and review different possibilities to address these tasks.