Why I Killed My 40,000-Token System Prompt

Here’s what surprised me most about building with AI for three years: the more instructions I gave it, the worse it performed.

Not a little worse. Measurably, consistently, infuriatingly worse.

By March 2026, my CLAUDE.md file — the system prompt that Claude Code reads at the start of every session — had grown to 40,000 tokens. Every time Claude forgot a preference, I added a rule. Every time it ignored a convention, I added a section. Every time it made a mistake I’d corrected before, I added a warning in bold.

Forty thousand tokens of accumulated frustration, formatted as instructions.

And Claude was performing worse than it had six months earlier with a fraction of those rules.

The Context Window Is Not a Filing Cabinet

The assumption I was operating under — the assumption most developers operate under — is straightforward: more context equals better output. If the model doesn’t know your preferences, tell it. If it keeps making mistakes, document them. If it forgets your architecture, paste it in.

This assumption is wrong.

Research from multiple teams has shown that language models degrade significantly when context exceeds roughly 32,000 tokens. As one Hacker News user put it: “Every model seems to get confused when you feed them more than ~25-30k tokens. The models stop obeying their system prompts, can’t correctly find/transcribe pieces of code.” Another confirmed it bluntly: “Anything above 32k tokens fails to have acceptable recall, across GPT-4o, Sonnet, and Google’s Gemini Flash.”

My 40,000-token system prompt wasn’t helping Claude understand me better. It was drowning the model in context it couldn’t effectively process. The instructions at the bottom of the file — which often contained the most recent and most important rules — were the ones most likely to be ignored.

I was poisoning my own context window. And I didn’t realize it until the evidence was undeniable.

The Breaking Point

The frustration was specific and repeatable. By the fourth or fifth interaction in a session, Claude would start ignoring rules I’d explicitly written. One developer described this pattern perfectly: “By the fourth or fifth interaction, Claude Code starts ignoring your rules. It stops asking for confirmation. It forgets your workflow preferences. It’s like your CLAUDE.md instructions never existed.”

That was exactly my experience. I’d spend the first 10-30 minutes of every session rebuilding context that should have persisted. Over a work week, that’s several hours of lost productivity — not building, not shipping, just re-explaining things my AI assistant should already know.

The deeper problem wasn’t memory. I’d already built a memory system — a knowledge graph with vector search, MCP tools for Claude Code integration, semantic retrieval. I could store everything. I could retrieve anything.

But storing and retrieving isn’t thinking. You can have perfect recall and zero understanding. And that distinction changed everything.

The Discovery That Changed the Architecture

In early March 2026, I came across Daniel Miessler’s PAI (Personal AI Infrastructure) and reconnected with his Fabric project, which I’d first discovered through Network Chuck’s YouTube channel. After a day of working with PAI, the insight clicked: don’t try to cram everything into the context window. Classify first, then route only what’s needed.

The routing patterns from PAI and Fabric showed me how to turn a memory system into an intelligence pipeline. The key architectural shift: pre-classify every request using a trained classification model before the expensive model ever sees it. Figure out what the user needs, what effort level it requires, what capabilities are relevant — then construct a targeted intelligence packet with only the context that matters for this specific request.

Not “here’s everything I know about you.” Instead: “here’s exactly what you need for this task.”

One Week: March 21-28

I built the routing engine in seven days. The scope:

A decision router using a local model for pre-classification
Seven effort levels, from instant responses to comprehensive analysis
Intelligence packets that replaced the monolithic system prompt
A dynamic skill registry that grows from usage
A feedback loop where classification accuracy improves over time

The result was measurable on day one. CLAUDE.md collapsed from 40,000 tokens to approximately 1,200 tokens. The intelligence packet that replaced it contains only what’s relevant to the current request — user identity, project state, behavioral directives for this specific task type, and pre-computed analysis that the router already handled.

Performance was immediately and noticeably better at 1,200 tokens than it had been at 40,000.

Let that sink in. Ninety-seven percent less context. Better results.

Why This Isn’t Just Compression

Someone will read this and think: “So you summarized your system prompt. Anyone can do that.”

No. Summarization loses signal. If you take 40,000 tokens of instructions and compress them to 1,200, you’ve thrown away 97% of the information. The model won’t have what it needs.

What changed isn’t the amount of information — it’s when and how information gets delivered. The intelligence pipeline classifies the incoming request, determines what’s relevant, retrieves only the context that matters, and constructs a targeted delivery. A coding request gets coding context. A research request gets research context. A deployment request gets infrastructure context.

The 40,000 tokens of knowledge didn’t disappear. They got reorganized into a system that delivers the right 1,200 tokens for each specific situation. The knowledge lives in the pipeline — in entity memories, in learned patterns, in classification models that improve with every interaction. It just doesn’t all get dumped into the context window at once.

This is the difference between context engineering and prompt engineering. Prompt engineering optimizes what you say to the model. Context engineering optimizes the entire pipeline of what reaches the model, when, and why.

The Numbers After One Week

The week after the routing engine went live (March 21-28), development velocity across the gramatr platform jumped from 3.3 commits per day to 22.7 — a 7x increase, verifiable in the git log. I went from working on one project at a time to running three simultaneously: evolving gramatr itself, building a complete website for NEXT90, and building the gramatr.com website.

Four hundred twenty-two commits across five projects. Two complete websites built and deployed. A CRM stood up. New Kubernetes services launched. Training pipelines running.

All git-verifiable at github.com/bhandrigan.

The product was proving itself by powering its own creation.

The Lesson

The problem was never storage. Every AI memory tool on the market — Mem0, Zep, Letta, LangMem — solves storage and retrieval. That’s a solved problem. You can build a vector database in an afternoon.

The unsolved problem is intelligence. Not “how do I store what I know about this user,” but “how do I figure out what this user needs right now, and deliver exactly that.”

I killed my 40,000-token system prompt because it was the wrong solution to the right problem. The right problem is context engineering — the discipline of getting precisely the right information to the model at precisely the right time.

The system prompt was trying to make the model remember everything at once. The intelligence pipeline makes it understand what matters right now.

That’s not a small distinction. That’s the entire product.