Publikation
ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models
Ho Minh Duy Nguyen; Nghiem T. Diep; Trung Nguyen; Hoang-Bao Le; Tai Nguyen; Tien Nguyen; TrungTin Nguyen; Nhat Ho; Pengtao Xie; Roger Wattenhofer; James Zou; Daniel Sonntag; Mathias Niepert
In: The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS). Neural Information Processing Systems (NeurIPS-2025), December 2-12, USA, Advances in Neural Information Processing Systems, 12/2025.
Zusammenfassung
State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLAVA-MED and BIOMEDGPT, primarily depend on scaling model size and data volume, with training driven largely by autoregressive objectives. However, we reveal that this approach can lead to weak vision-language alignment, making these models overly dependent on costly instruction-following data. To address this, we introduce EXGRA-MED, a novel multi-graph alignment framework that jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence. To scale to large LLMs (e.g., LLaMa-7B), we develop an efficient end-to-end training scheme using black-box gradient estimation, enabling fast and scalable optimization. Empirically, EXGRAMED matches LLAVA-MED’s performance using just 10% of pre-training data, achieving a 20.13% gain on VQA-RAD and approaching full-data performance. It also outperforms strong baselines like BIOMEDGPT and RADFM on visual chatbot and zero-shot classification tasks, demonstrating its promise for efficient, high-quality vision-language integration in medical AI.
Projekte
- MASTER - MASTER: Mixed reality ecosystem for teaching robotics in manufacturing
- No-IDLE - Interactive Deep Learning Enterprise
- NoIDLEChatGPT - No-IDLE meets ChatGPT
