AGENTS.md Makes Your AI Coding Agent Worse - and Now There's Research to Prove It

3 March, 2026 Updated: 26 April, 2026 AI

Earlier this year I burnt through 5x my usual token budget because an AI influencer on X promised his custom instruction file would "supercharge" my coding agent. It didn't. The agent ran the same grep commands three times in a row, opened files it had already read, and produced code that ignored half the rules it was supposedly following. I spent more time babysitting the agent than I would have spent writing the code myself.

I stripped the instructions back to almost nothing - and everything improved. Fewer tokens, faster responses, better code. I wrote about the mechanics of why this happens in my guide to custom instructions. But at the time, my evidence was anecdotal. I knew it worked, but I couldn't point to a controlled study.

Then ETH Zurich published one.

The Promise of Context Files

Every major AI coding tool now supports some form of project-level instruction file. GitHub Copilot reads .github/copilot-instructions.md. Cursor reads .cursorrules. Claude Code reads CLAUDE.md. Google's Gemini CLI reads GEMINI.md. OpenAI's Codex reads AGENTS.md. The idea is the same across all of them: a markdown file checked into your repository that tells the AI agent how to behave in this specific project.

The community loves them. Repositories of "awesome" instruction files circulate on X and GitHub. Developers share their configurations like dotfiles. The implicit promise is straightforward - give the agent more context about your project and it will write better code.

It sounds logical. It's also wrong.

What Happened When I Tried a Viral Instruction Set

The instruction file I copied from X was about 2,000 tokens. It covered everything: coding style, architecture patterns, testing requirements, documentation standards, error handling philosophy, commit message format, and a dozen "always do X, never do Y" rules.

Here's what actually happened during a typical coding session:

Token spend exploded. The instructions were sent with every single API call. Over a 50-message session, that's 100,000 extra input tokens just for the instructions - before any actual code.

The agent became indecisive. With 40+ rules to satisfy, it would start implementing something, then circle back to check whether it was following a rule, then restructure the code, then circle back again. A task that should have taken 3 tool calls took 12.

Redundant operations multiplied. The instructions told the agent to "always review existing code before making changes" and "verify your changes work correctly after every edit." These sound reasonable in isolation. In practice, the agent would grep the entire codebase before a one-line fix, then run the full test suite after changing a translation key. Every instruction that says "always" triggers behaviour on every task, regardless of whether it's needed.

Rule compliance actually dropped. This is the counterintuitive part. With 40 rules, the agent followed maybe 60% of them consistently. When I cut it down to 8, compliance went above 90%. More rules meant worse adherence to each individual rule.

My Journey to Lean Instructions

I spent a few weeks experimenting. I'd remove a rule, run a coding session, and compare the output. The pattern was consistent: removing rules almost never made the output worse, and often made it better.

What survived the cuts:

Stack and versions - PHP 8.4, Symfony 8, Bootstrap 5 - because models default to older versions without this
Code style - British English spelling, LF line endings, hyphen instead of em dash - things specific to this project that contradict common defaults
Language features to prefer - static fn, match over switch, array_find(), readonly class, typed constants - because these are recent PHP additions the model won't default to
SEO and template rules - breadcrumb labels under 3 words, semantic HTML hierarchy, schema.org through a specific factory class - conventions unique to this codebase
Verification policy - when to run tests and when to skip - because without this, the agent verifies everything
Search efficiency - prefer broad searches over sequential narrow ones - because the default agent behaviour is to grep five times when one search would do

That's it. Six categories. Under 200 tokens total. Each one earns its place because the model would genuinely do something different without it.

What I dropped:

Generic best practices ("write clean code", "use SOLID principles", "use DDD")
Things any competent model already does ("use meaningful variable names", "keep methods short")
Framework defaults ("use dependency injection in Symfony" - it's the only way to wire services)
Obvious constraints ("don't introduce security vulnerabilities", "never trust external data")

The result was a set of instructions under 200 tokens. My token spend dropped back to normal. The agent was faster, more focused, and paradoxically followed the remaining rules more reliably than it had ever followed the full set.

I thought this was just my experience. Then ETH Zurich ran the experiment properly.

Then ETH Zurich Confirmed It

In February 2026, researchers at ETH Zurich published "Evaluating AGENTS.md: Do Instruction Files Help AI Coding Agents?" - the first rigorous study of whether project context files actually improve AI coding agent performance.

Their setup was solid. They tested multiple frontier models (Claude 3.5 Sonnet, GPT-4o, and others) on the SWE-bench Verified benchmark - 500 real GitHub issues from popular open-source projects. They compared three conditions: no instructions, default AGENTS.md generated from repository documentation, and enhanced AGENTS.md with detailed coding guidelines.

The results were clear:

Context files reduced success rates. Across every model tested, adding an AGENTS.md file either made no difference or actively hurt task completion. The best-performing configuration was consistently no instructions at all.

Costs increased by 20% or more. Agents with context files consumed more tokens, made more tool calls, and took longer to complete tasks. The instructions didn't just fail to help - they added measurable overhead.

Exploration became broader but less targeted. Agents with instructions explored more files and directories but were less efficient at finding the relevant code. They cast a wider net and caught less. The instructions were pushing the agents to do more work, not better work.

Rule compliance was inconsistent. Even when agents appeared to follow instructions, they did so selectively. Some rules were followed reliably, others were ignored entirely, and the pattern varied between models and tasks.

The Nuance the Paper Doesn't Cover

The ETH Zurich study is valuable, but it measures a specific thing: isolated bug fixes on open-source repositories. SWE-bench tasks are self-contained - an agent picks up an issue, fixes it, and moves on. It never returns to the same codebase. It doesn't need to maintain consistency across sessions.

Real-world development is different in ways that matter:

Multi-session consistency. When I work with an AI agent on the same project for weeks, I need it to use static fn every time, not just when it feels like it. Without instructions, the agent might use arrow functions in one session and traditional closures in the next. For isolated bug fixes this doesn't matter. For a maintained codebase it creates inconsistency that accumulates into technical debt.

Project-specific conventions that save time. "Skip verification for template edits" is a rule that saves me 30 seconds per template change. Over a week of development, that compounds. SWE-bench doesn't measure this kind of efficiency gain because each task is independent.

Stack version pinning. Without being told PHP 8.4, models will write PHP 8.1-compatible code. Without Symfony 8, they'll suggest patterns from Symfony 5. These aren't best practices - they're facts about the project that the model genuinely can't infer from the task description alone.

The research confirms that most instructions are harmful. It doesn't prove that all instructions are useless. The distinction matters.

Why Context Files Backfire

The ETH Zurich findings align with a mechanism I described in my custom instructions guide: instructions don't just add tokens - they change behaviour.

Rule dilution. Attention is finite. In a transformer, every token competes with every other token for attention weight. A 50-rule instruction set means each rule gets roughly 2% of the model's focus. A 5-rule set means each rule gets 20%. Fewer rules means stronger adherence to each one.

Conflicting signals. Long instruction sets almost always contain contradictions. "Keep code simple" + "always add comprehensive error handling" + "write detailed docstrings" = three rules pulling in different directions. The model has to resolve the conflict, and it resolves it inconsistently.

Behavioural overhead. Instructions that say "always" or "after every change" trigger actions on every task. "Always review existing code first" turns a 2-tool-call fix into a 6-tool-call fix. "Run tests after every change" adds a test execution after changing a CSS class. These rules don't scale with task complexity - they apply uniformly, which means they add the most overhead to the simplest tasks.

Redundant guidance. Most instructions tell the model to do things it would already do. "Use meaningful variable names" is not a rule - it's a default behaviour of every modern LLM. Including it wastes tokens and dilutes the rules that actually matter.

What Actually Works

Based on my experience and now validated by the research, effective instructions share three properties: they are minimal, specific, and conditional.

Minimal means including only things the model can't infer from context. If a senior developer in your stack would do it by default, don't include it.

Specific means concrete, testable rules rather than abstract principles. Not "write clean code" but "use match instead of switch for single-value returns." Not "follow best practices" but "PHP 8.4, Symfony 8."

Conditional means rules that apply selectively rather than universally. Not "always run tests" but "run tests only when there's genuine uncertainty - new routing, complex DI wiring, debugging a reported error. Skip for template edits, config values, copy changes."

If you want a starting point that already follows these principles, the Custom Instructions tool ships with lean templates per agent — Cursor, Claude Code, Codex, Gemini — that you can adapt to your stack rather than copy-pasting a viral 2,000-token file from X.

For the detailed mechanics of how instructions affect token costs, prompt caching, and tool call multiplication, see my full guide on writing custom instructions.

Comparison: Three Approaches

	Viral / bloated instructions	Lean project-specific instructions	No instructions
Token overhead per request	~800-2,000	~150-200	0
Task success rate (SWE-bench)	Lower than baseline	Similar to baseline	Baseline
Cost per session	20-50% higher	5-10% higher	Baseline
Rule compliance	Low (too many rules)	High (few, clear rules)	N/A
Multi-session consistency	Poor (rules ignored selectively)	Good (rules followed reliably)	Poor (no conventions enforced)
Unnecessary tool calls	Many ("always verify", "always review")	Few (conditional rules)	Few
Best for	Nothing	Maintained projects with specific conventions	One-off tasks, isolated bug fixes

Less Is More

The instinct to give AI agents more context is understandable. More information should mean better results. But LLMs are not people reading a briefing document - they are statistical models where every additional token competes for attention, triggers behaviour, and costs money.

The ETH Zurich research confirms what many practitioners have discovered independently: the default instruction files circulating in the community make agents worse, not better. The agents explore more, spend more, and accomplish less.

The answer is not to abandon instructions entirely. It's to treat them like code - every line should earn its place. If a rule doesn't change the model's behaviour in a way you can observe, delete it. If a rule applies to every task equally, make it conditional. If a rule describes something any competent model already does, it's wasting tokens.

Less is more. The research now proves it.