Agentic AI Is Moving From Bigger Models to Bigger Systems

A new set of papers points to the next frontier for AI agents: not only stronger foundation models, but better orchestration harnesses, verifiable mobile environments, persistent context, and evaluation systems that can measure real-world task execution.

Agentic AI Is Moving From Bigger Models to Bigger Systems cover image

AI Agents

A new cluster of AI research points to a practical shift in agent development: the next gains may come less from simply making foundation models larger, and more from scaling the systems that surround them.

For the last several years, AI progress has often been described through model size, benchmark scores, and raw reasoning ability. But agentic AI is exposing a different bottleneck. Once a model is expected to use tools, remember prior work, act through apps, verify outcomes, and recover from mistakes, the surrounding execution layer becomes just as important as the model itself.

That is the central argument of the new arXiv paper From Model Scaling to System Scaling: Scaling the Harness in Agentic AI. The paper frames the “agent harness” — memory, context construction, tool use, skill routing, orchestration, verification, and governance — as a first-class object of design rather than a secondary implementation detail.

Why it matters: For builders, this reframes agent progress around engineering discipline: reliable workflows, auditable state, testable tool calls, safe memory, and measurable execution quality. A stronger model can help, but a weak harness can still turn capability into unreliable behavior.

From model-centric benchmarks to harness-level systems

The system-scaling paper argues that agent performance emerges from the interaction between the foundation model and the layers around it. In practical terms, the model is only one component inside a larger operating system for action. The agent must decide what context to load, what memories to trust, which skill or tool to route to, how to check the result, and when to stop.

That changes what should be measured. Traditional agent evaluations often compress a long interaction into a final success or failure. The new framing calls for benchmarks that also measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. Those are the metrics that matter when agents move from demos into products and internal business workflows.

The paper also introduces CheetahClaws, a Python-native reference harness, and compares its design direction with systems such as Claude Code and OpenClaw. The broader point is not that one harness wins today, but that the harness itself is becoming a competitive surface.

MobileGym turns mobile agents into a more testable problem

A second paper, MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research, shows why this shift is happening. Mobile app agents cannot be evaluated only by whether they produce plausible text. They need to click, scroll, type, navigate, and complete tasks inside interface states that can be checked reliably.

MobileGym proposes a browser-hosted mobile environment with structured JSON state, deterministic state-based judging, and scalable parallel rollouts. According to the paper, a single server can host hundreds of parallel instances, with roughly 400 MB memory per instance and about a three-second cold start. Its benchmark includes 416 parameterized task templates across 28 apps, with deterministic judges and an AnswerSheet protocol designed to avoid brittle free-text matching.

This is important because reinforcement learning and automated evaluation need high-volume, verifiable environments. If an agent is trained to operate a mobile app, researchers need to know not just that the agent sounded correct, but that the task state actually changed in the intended way.

Always-on assistants raise the bar even further

The third related signal comes from Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User’s Digital World, highlighted on Hugging Face Papers and available on arXiv. The benchmark targets personal assistants that have access to broader user context: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interactions across multiple devices.

That broader setting makes the problem much harder. The paper reports that GPT-5.5 reaches only 34.5% pass@1 on the benchmark, substantially below prior benchmarks. The result is a useful reality check: agents may feel impressive in narrow tasks, but persistent assistants must reason across noisy histories, conflicting signals, privacy boundaries, and changing user goals.

Claw-Anything also evaluates proactive assistance — cases where an agent must anticipate what the user needs instead of waiting for a direct command. That capability will be valuable in real products, but it also increases the need for permissions, transparency, and safe interruption policies.

The emerging stack: model, harness, environment, evaluator

Taken together, these papers suggest that agentic AI is becoming a systems problem with four tightly connected layers:

  • The model: reasoning, language understanding, planning, and multimodal perception.
  • The harness: memory, context selection, tools, skill routing, orchestration loops, and governance.
  • The environment: apps, mobile interfaces, files, CLIs, APIs, user histories, and backend services.
  • The evaluator: deterministic checks, outcome verification, trajectory analysis, and safety constraints.

This is a healthier way to think about AI agents. It moves the industry away from “the model will figure it out” and toward a more inspectable architecture. In that architecture, every layer can be improved, tested, logged, and governed.

What builders should watch next

For startups and enterprise teams, the message is practical. Agent quality will increasingly depend on durable infrastructure: persistent memory stores, permission-aware context pipelines, robust tool APIs, simulation environments, replayable traces, and evaluation suites that can catch regressions before users do.

The benchmark race will also change. Instead of asking whether an agent can solve a one-off task, the more valuable question will be whether it can improve safely across repeated workflows, maintain clean state, and prove that it completed real actions. That is especially true for mobile GUI agents and always-on assistants, where the environment is messy and user trust is fragile.

The caution is equally important. These papers do not mean fully autonomous personal agents are solved. They show the opposite: today’s systems still struggle when the scope expands. But they also outline a clearer path forward. The next wave of agent progress may come from better systems around models — not just bigger models inside them.

Sources

Comments (0)

Please log in to post comments or replies.
No comments yet. Be the first to start the discussion.