Context Engineering: What Anthropic, Karpathy, and Shopify's CEO Agree On

Three quotes. Three different corners of the tech industry. The same insight.

Andrej Karpathy (former Tesla AI Director, OpenAI co-founder): “Context engineering is the delicate art and science of filling the context window with just the right information for the next step.”

Tobi Lutke (CEO, Shopify): “I really like the term ‘context engineering’ over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.”

Anthropic (September 2025): “Building with language models is becoming less about finding the right words and more about engineering the right context.”

When a co-founder of OpenAI, the CEO of a $100B+ company, and the organization behind Claude all converge on the same term — independently, within months of each other — it’s not a branding exercise. It’s a recognition that the industry has been solving the wrong problem.

Why Prompt Engineering Failed

For three years, the dominant approach was prompt engineering: the craft of writing better instructions to language models. Write clearer prompts. Add examples. Specify the output format. Use chain-of-thought. The entire discipline assumed that the bottleneck was how you talked to the model.

It wasn’t.

The bottleneck was what the model knew when you talked to it.

The difference is fundamental. Prompt engineering optimizes the question. Context engineering optimizes the knowledge available to answer it. You can write the most perfectly structured prompt in history, and if the model doesn’t have the relevant context — your codebase architecture, your design decisions, your project history, your preferences — the response will be generic at best and wrong at worst.

This is what Stack Overflow’s 2025 Developer Survey surfaced in hard numbers: 66% of developers say the biggest frustration with AI tools is solutions that are “almost right, but not quite.” Not wrong. Not useless. Almost right. Which often means the model understood the question perfectly but lacked the context to answer it correctly.

Sixty-five percent of developers report that AI assistants specifically “miss relevant context” when performing refactoring, according to Qodo’s 2025 State of AI Code Quality research. The model can refactor code. It just doesn’t know your conventions, your patterns, your reasons for structuring things the way you did.

Prompt engineering can’t fix this. No matter how well you write the prompt, if the context isn’t there, the answer will be “almost right, but not quite.”

What Context Engineering Actually Means

Karpathy’s definition is precise and worth unpacking: “the delicate art and science of filling the context window with just the right information for the next step.”

Three words matter: “just the right.”

Not all information. Not a knowledge dump. Not “here’s everything I know about this user.” The right information. For this step. Right now.

This is an engineering problem with five distinct stages:

Classification. Before you can deliver the right context, you need to understand what the request actually needs. A coding question about your test suite needs different context than a deployment request for the same project. A quick “what’s the status?” question needs a fraction of the context that a “redesign this architecture” request needs. Classification determines what kind of context is relevant.

Retrieval. Once you know what’s needed, you pull the relevant context from wherever it lives — knowledge graphs, project files, conversation history, learned preferences. This is the part the “AI memory” market focuses on. It matters. It’s also only one stage of five.

Assembly. Retrieved context needs to be structured for the model. Raw data dumps degrade performance. The context needs to be organized — what’s most relevant goes first, project state is summarized rather than enumerated, behavioral directives are clear and specific. This is where the engineering in “context engineering” happens.

Delivery. The assembled context needs to fit within the model’s effective processing range. Research and practical experience consistently show that models degrade above roughly 32,000 tokens. An intelligence pipeline that assembles 50,000 tokens of “relevant context” has defeated its own purpose. Delivery means keeping the payload targeted — typically under 5,000 tokens for most requests — while maintaining completeness.

Feedback. Did the context delivery work? Was the classification accurate? Did the model have what it needed? A real context engineering system captures this signal and uses it to improve classification accuracy over time. Without feedback, the system is static. With it, the system learns.

Most tools in the AI ecosystem address one of these stages. RAG systems handle retrieval. Prompt templates handle assembly. Token counters monitor delivery. Nobody — until now — has built a system that handles all five as an integrated pipeline.

Why Prompt Engineering Persists

If context engineering is the real discipline, why is the industry still focused on prompts?

Because prompt engineering is visible. You can see the prompt. You can iterate on it. You can share it. There’s a satisfying craftsmanship to writing a good prompt — the same satisfaction as writing a good SQL query or a clean function signature.

Context engineering is invisible. When it works, the model just gives better answers. The user doesn’t see the classification that happened before their request was processed. They don’t see the selective retrieval that pulled project-specific context instead of generic knowledge. They don’t see the effort calibration that decided this question needed a 200-token answer, not a 2,000-token analysis.

The best infrastructure is invisible. That’s the nature of the job.

There’s also a tooling gap. Prompt engineering tools are straightforward to build — text editors with model integration. Context engineering platforms require classification models, knowledge graphs, routing logic, feedback loops, effort calibration systems, and delivery optimization. The engineering complexity is an order of magnitude higher.

How gramatr Implements Context Engineering

This is where I’ll get specific about what gramatr actually does, because “context engineering platform” is meaningless without concrete architecture.

Every request is pre-classified. When a request comes in, a trained classification model determines the intent type (coding, research, deployment, content, analysis, and others), the effort level (seven levels from instant to comprehensive), and which of twenty-five registered capabilities are relevant. This classification happens locally, before the expensive model is invoked.

Context is assembled per-classification. The intelligence pipeline constructs a context package based on the classification — not based on keyword matching or vector similarity alone. A coding request at effort level 3 gets a different context assembly than a coding request at effort level 6. The system understands that “quick fix” and “architecture redesign” are both coding tasks but need fundamentally different context.

Delivery is optimized for model performance. The assembled context targets approximately 5,000 tokens per request on average. Compare this to static approaches: a 40,000-token system prompt, or a RAG system that retrieves every relevant chunk regardless of the request’s actual complexity. More context is not better context. Right context is better context.

Every interaction feeds the feedback loop. When the classification is wrong — when a request gets routed to the wrong effort level or the wrong capabilities are selected — that signal trains the classifier. Over 901 evaluations and 4,189 queries, the classification accuracy has improved measurably. The pipeline gets smarter from usage, not from manual tuning.

The pipeline produces intelligence packets, not augmented prompts. The output of gramatr’s context engineering pipeline isn’t “your prompt plus some retrieved context.” It’s a structured intelligence packet: pre-computed analysis, behavioral directives specific to this request type, relevant memory context, project state, and capability audit — all scoped to exactly what this request needs.

The difference matters. An augmented prompt says “here’s your question plus some context.” An intelligence packet says “here’s what we already know about this request, here’s what the model should focus on, here’s what’s relevant from your history, and here’s how to handle this type of task.”

The Industry Convergence

What Karpathy, Lutke, and Anthropic are recognizing — and what the developer frustration data confirms — is that we’ve reached the end of what prompt engineering alone can achieve.

The models are plateauing. IEEE Spectrum reported in January 2026 that “over the course of 2025, most of the core models reached a quality plateau, and more recently, seem to be in decline.” Developer trust in AI output is dropping — from above 70% in 2023-2024 to 60% in 2025.

The next productivity leap isn’t coming from bigger models. It’s coming from better context delivery. From systems that understand what the model needs for each specific task and deliver exactly that — no more, no less.

This is what Karpathy means by “the delicate art and science.” It’s not about brute force. It’s not about cramming everything into the context window. It’s about precision: the right information for the next step.

The industry is converging on this realization. The tooling is still catching up.

What Happens Next

Context engineering is currently a manual discipline. Developers manage their own context — curating system prompts, building project-specific prompts, manually selecting what to include and exclude. It works, but it doesn’t scale. It doesn’t learn. And it puts the burden on the human instead of the system.

The next step is automation. Systems that perform context engineering without human intervention — classifying requests, selecting relevant context, calibrating effort levels, and improving from feedback. Not because automation is always better, but because humans shouldn’t spend their cognitive budget on context management when they could spend it on the actual work.

Over the week of March 21-28, the gramatr intelligence pipeline processed 4,189 requests across five projects, saving an estimated 20.9 million tokens through intelligent routing. That’s 20.9 million tokens that didn’t need to be sent to the model because the pipeline already knew what was relevant.

Twenty million tokens of context, managed automatically. That’s context engineering at scale.

And the pipeline was better at it on day seven than day one. Because it learns.

That’s the difference between a discipline and a platform. A discipline tells you what to do. A platform does it for you — and gets better every time.