TL;DR: This week we dig into three counterintuitive truths about building AI agents: bigger memory can make them worse, fewer-but-sharper tools beat one clever do-everything tool, and no single AI model should do every job. If you’re new to this: an AI agent is a program that uses a large language model to plan and act on its own, and how you feed it information matters more than how much you feed it.

Smarter, Not Bigger: Three Lessons in Building Agents That Actually Work

Who is this for? Developers starting to build LLM-powered apps, and senior engineers looking to sharpen their agent architecture.

Introduction

There’s a seductive idea in AI engineering: if a little context helps, more must help more. Give the model the whole codebase, the entire chat history, every document — and watch it get smarter. This week’s research says the opposite. Three threads converge on a single discipline: curation beats accumulation. Whether you’re managing what an agent reads (context), what an agent can do (tools), or which brain handles each step (model routing), the winning move is restraint. Below, we unpack why stuffing the window backfires, why granular tools outperform clever ones, and why the best workflows hand different jobs to different models.

Why More Context Makes Your Agent Dumber

Imagine a colleague who reads a 50-page brief but only remembers the first page and the last. Everything in the middle? Gone. That’s not a hypothetical — it’s how LLMs actually behave.

The foundational evidence comes from “Lost in the Middle” (Liu et al., TACL 2024), which found a U-shaped performance curve: models reliably use information at the start and end of a prompt but lose it in the center. When the answer document moved from first position to the middle of a 20-document context window, accuracy dropped over 30%.

You might assume newer, bigger models fixed this. They didn’t. Chroma’s Context Rot study tested 18 modern models including GPT-4.1, Claude 4, and Gemini 2.5, and found that “performance grows increasingly unreliable as input length grows” — even on trivial tasks. The popular Needle-in-a-Haystack test (hiding one fact in a long document) hides the problem, because real reasoning is harder than spotting a single fact.

Andrej Karpathy's analogy, cited by LangChain: the LLM is the CPU and its context window is RAM — finite working memory you must curate, not flood. (LangChain)

The fix is context engineering“filling the context window with just the right information at each step.” Anthropic recommends three tactics for long tasks: compaction (summarize and restart the window), structured note-taking, and multi-agent splitting. LangChain organizes the toolkit into four buckets: Write, Select, Compress, Isolate.

graph LR
  A[Raw context dump] --> B[Select: retrieve + rerank]
  B --> C[Compress: summarize]
  C --> D[Place key facts at edges]
  D --> E[Sharper agent answer]
Quick win: when a conversation nears the window limit, summarize it and start fresh from the summary. You keep the thread without dragging the lossy middle along.

This topic was surfaced by a sharp YouTube breakdown on the U-curve problem.

Designing Tools Agents Can Actually Use

A coding agent without browser access is “programming with a blindfold on” — it writes code but can’t see what that code does when it runs. That framing comes from Google’s Chrome team, and their fix is a useful case study in tool design.

First, jargon: an MCP server is how an agent reaches out to the real world — to a browser, a database, an API. The official Chrome DevTools MCP gives agents eyes into a live browser.

The key lesson is granularity. Chrome exposes many small, verb-named tools — performance_start_trace, navigate_page, list_console_messages — instead of one giant “debug this” command. The agent composes them into a workflow: start Chrome, open the page, record a trace, then analyze it.

Think of tools like a kitchen. A monolithic "make dinner" button hides everything from the cook. A drawer of labeled utensils — whisk, peeler, ladle — lets the cook improvise any dish. Agents are non-deterministic cooks; they need utensils, not one magic button.

Anthropic argues we must “design tools for agents,” not model them on traditional APIs. Too many overlapping tools confuse the agent about which to call. Their fix: namespace related tools (asana_search, asana_projects_search), and the prefix-versus-suffix choice measurably affects tool-use scores.

A practitioner deep-dive reinforces it: “granular tools beat monolithic tools.” Replace one smart_search(query, auto_filter) with three explicit tools — search_by_person, search_by_date_range, search_semantic — and return structured data, not formatted strings, so the agent can reason over results. One production server ran 13 specialized tools at under 300ms latency.

Avoid the "smart tool" trap: a single tool that parses intent, filters, sorts, and formats is impossible to debug and hides logic from the agent. Split it.

This came from a candid Google postmortem video on getting MCP granularity wrong.

Routing Each Job to the Right Model

You wouldn’t ask a Michelin chef to wash the dishes. Yet most agents use one expensive flagship model for everything — planning, designing, and the mechanical grunt work. Multi-model routing fixes that mismatch.

Anthropic defines routing as classifying an input and directing it to a specialized follow-up task: “separation of concerns, and building more specialized prompts.” Pair it with prompt chaining — breaking work into sequential steps with programmatic “gate” checks between them — and you get a clean plan → design → build pipeline where each phase uses a different model.

The economics are striking. LMSYS’s RouteLLM trains a router on ~80,000 human preference comparisons to decide, per query, between a strong/expensive and weak/cheap model. Their best router hit 95% of GPT-4 quality using only ~26% GPT-4 calls — cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K.

graph LR
  A[Task] --> B{Phase?}
  B -->|Plan| C[Strong reasoning model]
  B -->|Design| D[Design-strength model]
  B -->|Build| E[Cheap fast model]
An eval is how you'd verify your routing actually saves money without hurting quality — test it on real tasks before trusting it.

The open-source RouteLLM framework exposes a single calculate_strong_win_rate method plus a tunable threshold — your cost/quality dial. Route a strong reasoning model to planning, a design-strength model to UI work, and cheaper models to routine code generation.

Start simple: pick two models — one premium, one cheap — and route only your hardest step to the premium one. Measure the cost drop.

Surfaced via a YouTube discussion on the shift from single-model loyalty to multi-model composition.

Key Takeaways

Further Reading