
Existing benchmarks for LLM-based agentic systems are model-centric, fixing the agentic setup without comparing system components like harness engineering choices. MASEval is a framework-agnostic evaluation library that treats the entire agent system as the unit of analysis. Through the first systematic system-level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that framework choice impacts performance comparably to model choice within a capability tier.