Publication
SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
Hikaru Shindo; Hanzhao Lin; Lukas Henrik Helff; Patrick Schramowski; Kristian Kersting
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2604.16022, Pages 1-26, arXiv, 2026.
Abstract
As Large Language Models (LLMs) transition
from text processors to autonomous agents, eval-
uating their social reasoning in embodied multi-
agent settings becomes critical. We introduce So-
cialGrid, an embodied multi-agent environment
inspired by Among Us that evaluates LLM agents
on planning, task execution, and social reason-
ing. Our evaluations reveal that even the strongest
open model (GPT-OSS-120B (OpenAI, 2025))
achieves below 60% accuracy in task completion
and planning, with agents getting stuck in repeti-
tive behaviors or failing to navigate basic obsta-
cles. Since poor navigation confounds evaluation
of social intelligence, SocialGrid offers an op-
tional Planning Oracle to isolate social reasoning
from planning deficits. While planning assistance
improves task completion, social reasoning re-
mains a bottleneck: agents fail to detect deception
at near-random chance regardless of scale, relying
on shallow heuristics rather than accumulating be-
havioral evidence. SocialGrid provides automatic
failure analysis and fine-grained metrics, enabling
developers to diagnose and improve their agents.
We also establish a competitive leaderboard using
Elo ratings from adversarial league play.
