Skip to main content Skip to main navigation

Publikation

SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

Hikaru Shindo; Hanzhao Lin; Lukas Henrik Helff; Patrick Schramowski; Kristian Kersting
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2604.16022, Pages 1-26, arXiv, 2026.

Zusammenfassung

As Large Language Models (LLMs) transition from text processors to autonomous agents, eval- uating their social reasoning in embodied multi- agent settings becomes critical. We introduce So- cialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reason- ing. Our evaluations reveal that even the strongest open model (GPT-OSS-120B (OpenAI, 2025)) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repeti- tive behaviors or failing to navigate basic obsta- cles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an op- tional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning re- mains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating be- havioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.

Weitere Links