Skip to main content Skip to main navigation

Publication

ART: Adaptive Relation Tuning for Generalized Relation Prediction

Gopika Sudhakaran; Hikaru Shindo; Patrick Schramowski; Simone Schaub-Meyer; Kristian Kersting; Stefan Schroth
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2507.23543, Pages 1-17, Computing Research Repository, 2025.

Abstract

Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to gener- alize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language mod- els (VLMs) for VRD, it uses handcrafted prompts and strug- gles with novel or complex relations. We argue that instruc- tion tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction- tuning format and employing an adaptive sampling algo- rithm, ART directs the VLM to focus on informative rela- tions while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate be- tween them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our ap- proach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART’s practical value by us- ing the predicted relations for segmenting complex scenes.

More links