MASEval: Extending Multi-Agent Evaluation from Models to Systems

Cornelius Emde, Alexander Rubinstein*, Anmol Goel*, Ahmed Heakl*, Sangdoo Yun, Seong Joon Oh, Martin Gubri

Abstract

Existing benchmarks for LLM-based agentic systems are model-centric, fixing the agentic setup without comparing system components like harness engineering choices. MASEval is a framework-agnostic evaluation library that treats the entire agent system as the unit of analysis. Through the first systematic system-level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that framework choice impacts performance comparably to model choice within a capability tier.

Publication

arXiv 2026

Links

arXiv PDF RTAI Code