LLM-based agents have made code generation ~a million times more cheaper than it has ever been. But, taking a look around, we don’t see a million times more useful software being written. We don’t see AI labs laying off their software engineers — no, they are all expanding!
To understand why we aren’t shipping code at inference speed, we must understand the fundamental properties of language models that prevent this. Then, we can try to maximize our productivity within these constraints.
Part I: Three Fundamental Properties of LLM-Based Agents
Finite Context Window
The hype behind LLMs is entirely driven by their ability to learn in-context. As a conversation progresses, the language model is able to acquire new skills and knowledge without being trained. This is truly a magical property! Pre-LLMs, learning a new skill was synonymous with training a new neural network. As it turns out, humans also learn in-context. If we think of our genome as our “base model”, which has been tuned via evolutionary learning over billions of years, the knowledge and skills we acquire over our lifetime can be thought of as in-context learning, where our context window spans our sensory experience over our lifetime. The analogy breaks down when we realize how tiny an LLMs context window is compared to ours. Per second, humans ingest ~8.75 megabits of data per second. If streamed to a language model, this would exhaust a 200k token limit in less than half a second 1!
Even if we ignore the nightmare that is serving long-context inference, we currently don’t know how to achieve infinite context ICL. Every stage of LLM training — pretraining, midtraining, and RL — assume a maximum context length. Solving this is AGI-complete.
Property 2: Token-Native, Not World-Native
Humans perceive and reason through visual input. Language itself is something we developed to transmit this information to other humans in an efficient manner. We have to spend significant effort compressing our vision-over-time into langauge, and in that process a ton of information is lost.
LLMs on the other hand are only trained on this compressed representation of reality humans have developed. Such a system has some superhuman strengths, e.g. in competitive programming while making laughable mistakes like not having a consistent model of physics. [Example here] [^2]
So, we need to differentiate the two kinds of data: tokens and everything else. LLMs can only comprehend tokens. Everything else is foreign. Even LLM image understanding is bootstrapped from image captioning!
Another key limitation, which is closely tied to multimodal understanding, is that LLMs do not have any notion of time. But humans do. And (for now) LLMs exist to serve humans. So we need to take care that LLMs are not wasting time waiting on tasks, since waiting 3 hours feels the same to them as waiting 3 seconds. They can only feel their context window lengthening.
The big idea here is that there is a fundamental mismatch between how the world presents information, and how agents perceive information. This must be bridged with infrastructure.
Property 3: Two Types of Knowledge — Intuition vs. Reasoning
LLMs have two modes of “knowing”. One is through parametric knowledge which is accrued through pretraining. As they learn to predict the next token, they are forced to learn a consistent representation of language, and of the world. Another, is through interaction and in context reasoning. This is what makes LLMs magical — they are able to learn new skills without getting re-trained. The information they learn is stored and processed in the KV cache—a heavy set of hidden state that grows with context length. This knowledge is distinct from parametric knowledge due to a few properties:
-
One token in context is allocated more bits than one tokens seen during pretraining. Practically, this means the model has a high resolution view of its context but only a fuzzy view of pretraining knowledge. [^3]
-
There is a tradeoff between how out-of-pretraining-distribution a task is and how many tokens in context are required to solve it. [^4]
If you want to maximize how useful your agent can be, you must keep your tasks in distribution, or else you will run out of context quickly as it self-corrects and explores to fill in its world model.
Part II: Consequences
Everything below follows directly from the three properties above.
Humans Plan, Agents Execute
Because LLMs have finite context, they are not able to effectively do long horizon planning, which involves reasoning over context spanning months or years. What they can do, however, are bounded sub-tasks with strictly defined success criteria—and they can do this with unrelenting effort. As a human, you must still plan, define subtasks, and define success correctly.
The Bottleneck Has Shifted from Generation to Verification
Follows from: Property 1 + Property 2 (agents generate code trivially, but can’t verify against the real world)
- Before AI: Writing code was the bottleneck. Our entire ecosystem — frameworks, abstractions, DX tooling — was built to minimize the cost of getting a feature shipped. Lines of code were expensive.
- After AI: Generating code is nearly free. The new bottlenecks are planning and verification.
- Planning is unsolved (Property 1). But verification is addressable — and it’s where the most valuable engineering work now happens.
- The most valuable code you write isn’t application code — it’s verification infrastructure that lets agents check their own work.
Consequence 3: Bridge the Verification Gap with Token-Efficient APIs
Follows from: Property 2 (token-native, not world-native)
- Humans verify by looking at things: watching a robot, observing a UI, checking physical behavior. This is high-context, low-density, multimodal — the exact opposite of what LLMs are good at.
- To close the loop, build software that bridges real-world state into token-efficient representations: scripts or services that tell the agent, in a limited number of tokens, whether it’s on track and what went wrong if not.
- Example — hardware simulators: If the agent writes code that controls hardware (robots, actuators), verifying correctness requires observing physical behavior in real-time. Build an accurate simulator of the hardware’s API instead. The agent runs against the simulator at compute speed, debugs in a tight loop, no human needed. The simulator doesn’t need to be perfect — just accurate enough that verified code works on real hardware.
Consequence 4: E2E Tests Over Unit Tests
Follows from: Property 1 (coherence breaks down across context boundaries) + the nature of RL training
- Force the agent into test-driven development. Before implementing a feature, have it write tests that will pass assuming correct implementation. This constrains the solution space and prevents the agent from dumping code and declaring “done.”
- But which tests? LLMs are RL’d heavily on isolated, pure-function problems (Codeforces, LeetCode). Writing small functions that pass unit tests is literally what they’re superhuman at. Unit tests check the layer where agents almost never fail.
- Where agents do fail is maintaining coherence across a large codebase — components interacting incorrectly, state flowing wrong across boundaries, system-level behavior diverging from intent. These are cross-context errors, exactly the kind that arise when a model can’t hold the full system in its window.
- E2e tests (e.g., Playwright) catch the actual failure mode. They should be simple: click this button, expect this UI state, check this value. Not complicated to write, but they verify what you actually care about — does the app work?
- Match your verification to the failure mode. Unit tests check isolation (where agents are superhuman). E2e tests check integration (where agents actually break).
- Side note: unit tests are great for human developers, because humans do make mistakes on small isolated logic. The testing strategy should differ based on who’s writing the code.
Consequence 5: Optimize for Verification Loop Speed
Follows from: Property 2 (agents don’t experience time, but you do) + verification being the bottleneck
- Since verification is the bottleneck and agents iterate by running code repeatedly, the speed of compile → run → check → fix matters enormously.
- LLMs have no sense of time. They only experience tokens. They’ll wait 2 hours for a Rust compilation without complaint. But you experience time. You’re the one sitting there.
- Humans offset slow compiles by doing other work while waiting. Agents don’t multitask that way (yet).
- Optimize for fast everything: fast compiles, fast test runs, fast feedback loops.
Consequence 6: Choose In-Distribution, Simple, Fast Tools
Follows from: Property 3 (intuition is free, reasoning costs context) + Consequence 5 (loop speed)
- In-distribution tooling: Standard, popular languages and frameworks let the agent rely on parametric knowledge (intuition) — fast, free, no context cost. Esoteric or novel tools force the agent into chain-of-thought reasoning about its environment, which consumes the same finite context it needs for the actual task. This isn’t just slower — it fundamentally reduces how much the agent can accomplish per context window. If you write your web server in APL, even the most powerful model might manage one feature before running out of context.
- Simplicity over cleverness: Frameworks with lots of magic, implicit behavior, and complex build pipelines create verification friction. You want the simplest, most transparent toolchain possible.
- Concrete example — why Go works well:
- Fast compile times (the agent is rebuilding constantly)
- Fast run times (verification loops stay quick)
- Simple, readable code (easy for agent to reason about, easy for you to review)
- Server + frontend in one codebase (less context-switching)
- Template-based rendering (easy to debug)
- Go code is “fluffy” (verbose, not dense) — bad for token efficiency, but great for comprehension
- You might never choose Go for yourself (the type system complaints are valid). But Go optimizes for the things that matter in agent-driven workflows.
- What to avoid: Frameworks like Next.js that are optimized for developer ergonomics — a bottleneck that no longer exists. Languages like Rust where compile times create painful iteration loops for the human observer, even though the agent doesn’t care.
Consequence 7: Parallelize Independent Work Across Agent Sessions
Follows from: Property 1 (each agent has its own context window)
- When features are independent, spin up separate Claude Code sessions on separate git branches
- Each agent works in isolation on its own feature
- Natural parallelism: one human architect, multiple agent builders, multiple features progressing simultaneously
Conclusion / Looking Forward
- The value of a developer is shifting from code production to system design + verification infrastructure
- Frameworks and tooling built for the pre-AI era are over-optimized for the wrong bottleneck
- New tools and languages may emerge that are optimized for agent-driven development rather than human-driven development
- These principles hold as long as we’re using transformer-based, finite-context, token-native architectures. When that changes, the game changes. Until then, work with the architecture, not against it.
-
Assuming a 128k vocab size, 1 token can encode up to bits. So a 200k context window would take seconds to fill up. An expected objection to this naive calculation is that humans don’t use, nor are concious of, the whole 8.75 mbps of visual data — most of it is immediately forgotten. I argue it doesn’t matter. A language model’s context is its perceptive field — all information that it does ICL over. A human does ICL over that much data in 0.38 seconds. ↩︎