Publication

No Safe Dose: How Training Data Drives Unsafe Image Generation

Felix Friedrich; Lukas Henrik Helff; Niharika Hegde; Patrick Schramowski; Kristian Kersting

In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2605.28137, Pages 1-20, arXiv, 2026.

Abstract

Text-to-image models trained on large-scale data often inevitably ingest unsafe content. While some people observe input-output amplifications, it remains unclear whether and how training data composition directly drives model output safety or by other factors. We shed light on this question by isolating this variable: we train the same text-to-image model on datasets that differ only in their fraction of unsafe images (0% to 9.6%), across several dataset scales (100K to 8M). Then we generate images with the resulting models, and evaluate them with four independent safety classifiers. Output unsafety rises monotonically from 16.6% at 0% contamination to 25.5% at 5%. A factorial design reveals that the proportion, not the absolute count, of unsafe training images is the operative variable. The 16.6% irreducible baseline at zero contamination implicates the other components, e.g. frozen text encoder, as a residual safety risk—confirmed by a text encoder ablation showing that SafeCLIP reduces this floor to 9.6%, while the dose-response effect persists across all three encoders tested. Critically, no quality degradation in terms of FID, CLIPscore and ImageReward accompanies safety filtering. These results establish that data curation and text encoder safety are complementary and independently effective interventions. At the same time, the remaining level of unsafety poses questions for future research about emerging capabilities and compositionality.

No Safe Dose: How Training Data Drives Unsafe Image Generation

Abstract

More links