Publication
SLR: Automated Synthesis for Scalable Logical Reasoning
Lukas Henrik Helff; Ahmad Omar; Felix Friedrich; Antonia Wüst; Hikaru Shindo; Rupert Mitchell; Tim Woydt; Patrick Schramowski; Wolfgang Stammer; Kristian Kersting
In: Maria Liakata; Viviane P. Moreira; Jiajun Zhang; David Jurgens (Hrsg.). Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2026, San Diego, California, United States, July 2-7, 2026. Computational Linguistics Applications (CLA), Pages 402-426, Association for Computational Linguistics, 2026.
Abstract
We introduce SLR, an end-to-end framework
for systematic evaluation and training of Large
Language Models (LLMs) via Scalable Logical
Reasoning. Given a user’s task specification,
SLR automatically synthesizes (i) an instruc-
tion prompt for an inductive reasoning task,
(ii) a validation program, executable on model
outputs to provide verifiable rewards, and (iii)
the latent ground-truth rule. This process is
fully automated, scalable, requires no human
annotations, and offers precise control over task
difficulty. Using SLR, we create SLR-BENCH,
a benchmark comprising 19k prompts orga-
nized into 20 curriculum levels that progres-
sively increase in relational, arithmetic, and re-
cursive complexity. Large-scale evaluation re-
veals that contemporary LLMs readily produce
syntactically valid rules, yet often fail at cor-
rect logical inference. Recent reasoning LLMs
demonstrate improved performance but incur
very high test-time computation, with costs ex-
ceeding $300 for just 1,000 prompts. Finally,
curriculum learning via SLR doubles Llama-3-
8B accuracy on SLR-BENCH, achieving parity
with Gemini-Flash-Thinking at a fraction of
computational cost. Moreover, these reason-
ing capabilities generalize to a wide range of
established benchmarks, underscoring the ef-
fectiveness of SLR for downstream reasoning.
