Skip to main content Skip to main navigation

Publication

SLR: Automated Synthesis for Scalable Logical Reasoning

Lukas Henrik Helff; Ahmad Omar; Felix Friedrich; Antonia Wüst; Hikaru Shindo; Rupert Mitchell; Tim Woydt; Patrick Schramowski; Wolfgang Stammer; Kristian Kersting
In: Maria Liakata; Viviane P. Moreira; Jiajun Zhang; David Jurgens (Hrsg.). Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2026, San Diego, California, United States, July 2-7, 2026. Computational Linguistics Applications (CLA), Pages 402-426, Association for Computational Linguistics, 2026.

Abstract

We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user’s task specification, SLR automatically synthesizes (i) an instruc- tion prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-BENCH, a benchmark comprising 19k prompts orga- nized into 20 curriculum levels that progres- sively increase in relational, arithmetic, and re- cursive complexity. Large-scale evaluation re- veals that contemporary LLMs readily produce syntactically valid rules, yet often fail at cor- rect logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs ex- ceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3- 8B accuracy on SLR-BENCH, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reason- ing capabilities generalize to a wide range of established benchmarks, underscoring the ef- fectiveness of SLR for downstream reasoning.

More links