Publikation
How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions
Manuel Brack; Sudeep Katakol; Felix Friedrich; Patrick Schramowski; Hareesh Ravi; Kristian Kersting; Ajinkya Kale
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2506.16679, Pages 1-18, Computing Research Repository, 2025.
Zusammenfassung
Training data is at the core of any successful text-to-image
models. The quality and descriptiveness of image text are
crucial to a model’s performance. Given the noisiness and
inconsistency in web-scraped datasets, recent works shifted
towards synthetic training captions. While this setup is gen-
erally believed to produce more capable models, current lit-
erature does not provide any insights into its design choices.
This study closes this gap by systematically investigat-
ing how different synthetic captioning strategies impact the
downstream performance of text-to-image models. Our ex-
periments demonstrate that dense, high-quality captions en-
hance text alignment but may introduce trade-offs in output
aesthetics and diversity. Conversely, captions of random-
ized lengths yield balanced improvements across aesthetics
and alignment without compromising sample diversity. We
also demonstrate that varying caption distributions intro-
duce significant shifts in the output bias of a trained model.
Our findings underscore the importance of caption design
in achieving optimal model performance and provide prac-
tical insights for more effective training data strategies in
text-to-image generation.
