Skip to main content Skip to main navigation

Publikation

How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions

Manuel Brack; Sudeep Katakol; Felix Friedrich; Patrick Schramowski; Hareesh Ravi; Kristian Kersting; Ajinkya Kale
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2506.16679, Pages 1-18, Computing Research Repository, 2025.

Zusammenfassung

Training data is at the core of any successful text-to-image models. The quality and descriptiveness of image text are crucial to a model’s performance. Given the noisiness and inconsistency in web-scraped datasets, recent works shifted towards synthetic training captions. While this setup is gen- erally believed to produce more capable models, current lit- erature does not provide any insights into its design choices. This study closes this gap by systematically investigat- ing how different synthetic captioning strategies impact the downstream performance of text-to-image models. Our ex- periments demonstrate that dense, high-quality captions en- hance text alignment but may introduce trade-offs in output aesthetics and diversity. Conversely, captions of random- ized lengths yield balanced improvements across aesthetics and alignment without compromising sample diversity. We also demonstrate that varying caption distributions intro- duce significant shifts in the output bias of a trained model. Our findings underscore the importance of caption design in achieving optimal model performance and provide prac- tical insights for more effective training data strategies in text-to-image generation.

Weitere Links