Publication
No Safe Dose: How Training Data Drives Unsafe Image Generation
Felix Friedrich; Lukas Henrik Helff; Niharika Hegde; Patrick Schramowski; Kristian Kersting
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2605.28137, Pages 1-20, arXiv, 2026.
Abstract
Text-to-image models trained on large-scale data often inevitably ingest unsafe
content. While some people observe input-output amplifications, it remains unclear
whether and how training data composition directly drives model output safety or
by other factors. We shed light on this question by isolating this variable: we train
the same text-to-image model on datasets that differ only in their fraction of unsafe
images (0% to 9.6%), across several dataset scales (100K to 8M). Then we generate
images with the resulting models, and evaluate them with four independent safety
classifiers. Output unsafety rises monotonically from 16.6% at 0% contamination
to 25.5% at 5%. A factorial design reveals that the proportion, not the absolute
count, of unsafe training images is the operative variable. The 16.6% irreducible
baseline at zero contamination implicates the other components, e.g. frozen text
encoder, as a residual safety risk—confirmed by a text encoder ablation showing
that SafeCLIP reduces this floor to 9.6%, while the dose-response effect persists
across all three encoders tested. Critically, no quality degradation in terms of FID,
CLIPscore and ImageReward accompanies safety filtering. These results establish
that data curation and text encoder safety are complementary and independently
effective interventions. At the same time, the remaining level of unsafety poses
questions for future research about emerging capabilities and compositionality.
