TL;DR: Bigger context windows don’t mean smarter agents—models lose the middle of long inputs, and usable context is smaller than advertised. The fix is engineering, not scale: curate high-signal context, design tools agents can actually call, and treat flawed evals as directional signal while you invest in scaffolding.
The Context You Don’t Use Is Hurting You: This Week in Agent Engineering
Introduction
This week’s threads all point at the same uncomfortable truth: the bottleneck in agent performance isn’t the model anymore—it’s everything around it. Larger context windows quietly degrade reasoning, tool interfaces silently sabotage agents that can’t call them, and the benchmarks you trust often measure the wrong thing. Each problem looks different, but the fix rhymes: stop front-loading raw capacity and start engineering the system. Curate the tokens, shape the tools, and read your evals as signal rather than scripture. If you build agents for a living, these three findings should change how you spend your next sprint.
Why Bigger Context Windows Make Agents Dumber
Surfaced by Qodo’s talk on the U-curve attention problem, this one is counterintuitive: more context can make your agent worse.
The foundational evidence is Lost in the Middle (Liu et al., TACL 2024), which found a U-shaped performance curve—models exploit information at the start (primacy) and end (recency) of context but lose accuracy when key facts sit in the middle. Multi-document QA accuracy dropped by more than 30% when the answer moved from the edges to the middle of a 20-document context, and the effect held across six model families.
That was 2024. Chroma’s 2025 Context Rot report extended the finding across 18 frontier models including GPT-4.1, Claude 4, and Gemini 2.5, showing that “models do not use their context uniformly"—reliability decays as input grows even on trivial retrieval. One surprise: models often did better on randomly shuffled haystacks than logically structured ones, meaning document structure can mislead attention.
The practical antidote comes from Anthropic’s context engineering guidance: treat context as a finite attention budget and find “the smallest possible set of high-signal tokens.” Use just-in-time retrieval, compaction, structured note-taking, and sub-agents that explore with tens of thousands of tokens but return a 1,000–2,000 token summary.
graph LR
A[Retrieved Docs] --> B{Rank by relevance}
B --> C[High-value to edges]
B --> D[Compact mid-context]
C --> E[Curated Window]
D --> E
E --> F[Sharper Agent]The takeaway: assume your usable window is smaller than the spec sheet claims, and rank retrieved passages so the best ones land at the edges.
Designing Agent Interfaces: Lessons from Chrome DevTools MCP
Google’s “we built it wrong three times” story surfaced this topic, and the lessons generalize to any MCP tool you ship.
The Chrome DevTools MCP server gives coding agents “eyes” into the browser—solving the problem that agents otherwise, as the Chrome team puts it, are “programming with a blindfold on.” What makes its design notable isn’t the feature list; it’s the discipline. Tools are task-oriented: performance_start_trace starts Chrome, opens a page, and records a trace from a single call. The server collects tool invocation success rates and latency by default, treating observability as first-class. And it exposes panel data incrementally rather than dumping the entire DevTools surface at once.
Anthropic’s guide to writing tools for agents supplies the backbone checklist: design tools for agents, not as reused developer APIs; namespace overlapping tools (asana_search, jira_search); return actionable error messages instead of opaque tracebacks; and manage token budgets with truncation, filtering, and pagination. “Agents are only as effective as the tools we give them.”
The dominant anti-pattern, called out sharply in this AWS Heroes piece, is the thin REST wrapper that forces agents through silent multi-step plumbing—get_customer_by_email → list_orders → get_status—instead of one outcome-oriented tool that returns a tracking link and ETA. The five-minute litmus test: “Can an LLM discover this tool and call it correctly on the first try?” If not, you’ve built an API, not an agent interface.
Evals Are Broken—Use Them Anyway
Cline’s case study of pushing Terminal Bench up through harness fixes rather than model swaps frames the tension perfectly.
First, the broken part. A systematic review of 445 LLM benchmarks by 29 expert reviewers found recurring construct-validity failures—benchmarks frequently “don’t measure what they claim,” undermining the headline numbers everyone chases. Treat benchmark deltas as noisy directional signal, not ground truth.
Now the “use them anyway” case. Anthropic argues evals are a forcing function that makes teams specify what success means and resolve spec ambiguity. The concrete payoff: teams with evals can validate a new model and upgrade “in days” while competitors spend weeks testing—and once a suite exists, you get regression baselines for latency, cost, and error rates for free.
The decisive lever, though, is scaffolding. The Confucius Code Agent paper showed on SWE-Bench-Pro that a weaker model with strong scaffolding (Claude 4.5 Sonnet at 52.7%) beat a stronger model on a proprietary scaffold (Opus at 52.0%). The same scaffold lifted scores to 54.3% with Opus and 59.0% with GPT-5.2, exceeding vendor-reported numbers—gains attributed “purely to stronger agentic scaffolding,” not the backbone model. Use your imperfect evals to measure whether harness changes move the needle, and stop reflexively chasing the next flagship.
Key Takeaways
- Assume your usable context is smaller than advertised. Rank retrieved passages so the highest-value ones sit at the start and end of the window—the middle is the model’s blind spot (Liu et al.).
- Apply this today: run the five-minute litmus test on your MCP tools—can an LLM discover and call each correctly on the first try? Fix the ones that fail.
- Build outcome-oriented tools that return high-signal results and fail loudly with guidance, not thin REST wrappers that hide multi-step plumbing.
- Invest in scaffolding over model upgrades—orchestration and context management can outperform a stronger model.
- Keep flawed evals anyway. They’re directional signal and the difference between upgrading in days versus weeks.
Further Reading
- Lost in the Middle (Liu et al., TACL 2024) — the foundational study on U-shaped context attention.
- Context Rot (Chroma, 2025) — degradation across 18 frontier models, even on trivial tasks.
- Effective Context Engineering for AI Agents (Anthropic) — compaction, note-taking, and sub-agent patterns.
- Writing Tools for AI Agents (Anthropic) — the authoritative tool-design checklist.
- Demystifying Evals for AI Agents (Anthropic) — why imperfect evals still pay off.