CONTENTS

How I vibe code infra That WorksDraft

LLM-based agents have made code generation ~a million times more cheaper than it has ever been. But, taking a look around, we don’t see a million times more useful software being written. We don’t see AI labs laying off their software engineers — no, they are all expanding!

To understand why we aren’t shipping code at inference speed, we must understand the fundamental properties of language models that prevent this. Then, we can try to maximize our productivity within these constraints.

Part I: Three Fundamental Properties of LLM-Based Agents

Finite Context Window

The hype behind LLMs is entirely driven by their ability to learn in-context. As a conversation progresses, the language model is able to acquire new skills and knowledge without being trained. This is truly a magical property! Pre-LLMs, learning a new skill was synonymous with training a new neural network. As it turns out, humans also learn in-context. If we think of our genome as our “base model”, which has been tuned via evolutionary learning over billions of years, the knowledge and skills we acquire over our lifetime can be thought of as in-context learning, where our context window spans our sensory experience over our lifetime. The analogy breaks down when we realize how tiny an LLMs context window is compared to ours. Per second, humans ingest ~8.75 megabits of data per second. If streamed to a language model, this would exhaust a 200k token limit in less than half a second 1!

Even if we ignore the nightmare that is serving long-context inference, we currently don’t know how to achieve infinite context ICL. Every stage of LLM training — pretraining, midtraining, and RL — assume a maximum context length. Solving this is AGI-complete.

Token-Native, Not World-Native

Diagram showing the gap between world-native and token-native perception
Diagram showing the gap between world-native and token-native perception

Humans perceive and reason through visual input. Language itself is something we developed to transmit this information to other humans in an efficient manner. We have to spend significant effort compressing our vision-over-time into langauge, and in that process a ton of information is lost.

LLMs on the other hand are only trained on this compressed representation of reality humans have developed. Such a system has some superhuman strengths, e.g. in competitive programming while making laughable mistakes like not having a consistent model of physics. [Example here] [^2]

So, we need to differentiate the two kinds of data: tokens and everything else. LLMs can only comprehend tokens. Everything else is foreign. Even LLM image understanding is bootstrapped from image captioning!

Another key limitation, which is closely tied to multimodal understanding, is that LLMs do not have any notion of time. But humans do. And (for now) LLMs exist to serve humans. So we need to take care that LLMs are not wasting time waiting on tasks, since waiting 3 hours feels the same to them as waiting 3 seconds. They can only feel their context window lengthening.

The big idea here is that there is a fundamental mismatch between how the world presents information, and how agents perceive information. This must be bridged with infrastructure.

Property 3: Two Types of Knowledge — Intuition vs. Reasoning

LLMs have two modes of “knowing”. One is through parametric knowledge which is accrued through pretraining. As they learn to predict the next token, they are forced to learn a consistent representation of language, and of the world. Another, is through interaction and in context reasoning. This is what makes LLMs magical — they are able to learn new skills without getting re-trained. The information they learn is stored and processed in the KV cache—a heavy set of hidden state that grows with context length. This knowledge is distinct from parametric knowledge due to a few properties:

  1. One token in context is allocated more bits than one tokens seen during pretraining. Practically, this means the model has a high resolution view of its context but only a fuzzy view of pretraining knowledge. [^3]

  2. There is a tradeoff between how out-of-pretraining-distribution a task is and how many tokens in context are required to solve it. [^4]

If you want to maximize how useful your agent can be, you must keep your tasks in distribution, or else you will run out of context quickly as it self-corrects and explores to fill in its world model.

Part II: Consequences

Everything below follows directly from the three properties above.

Humans Plan, Agents Execute

Because LLMs have finite context, they are not able to effectively do long horizon planning, which involves reasoning over context spanning months or years. This doesn’t mean they cannot assist with architectural planning—it just means that they’re not able to continually incoporate requirements and context over time to inform these decisions. What they can do (extremely well), however, are bounded sub-tasks with strictly defined success criteria—and they can do this with unrelenting effort. As a human, you must still plan, define subtasks, and define success correctly. This is now the primary purpose of a software engineer.

The Bottleneck Has Shifted from Generation to Verification

Before we had Agents, the entire developer ecosystem was optimized assuming the bottleneck was code generation. Billion dollar SaaS companies exist just to reduce the cognitive overhead of writing code. This is no longer true.

Now, the bottleneck is verification, and the common wisdom must be revised to account for this. The question now is: how can we automate verification? Given our constraints, we need to optimize our infrastructure and tooling to be LLM friendly. This means the most valuable code that we write is verification infrastructure, that lets agents check their own work.

Bridge the Verification Gap with Token-Efficient APIs

The behavior of a lot of software is usually only verified by looking at things: watching a robot, observing a UI, etc. This is high context, low-density, and multimodal data, exactly what LLMs cannot ingest. If you’re working with hardware or embedded systems, it’s probably worth writing an accurate software simulator for the system so that LLMs can debug and verify its interfacing code in parallel.

E2E Tests Over Unit Tests

Agents are heavily trained with RLVR for coding. They’re put in isolated software engineering environments, given a problem and a suite of test cases, and trained to make them all green. But most real-world software is difficult to express in this way. Most of the time, your unit tests don’t guarantee end to end correctness. And if you ask a model to write unit tests to verify its own work, it tends to only test logical units rather than integration, which is where most bugs are usually.

To get around this, always require your agents to write end-to-end tests. This is a test that, when green, guarantees with certainty the code is correct. They’re usually far more expensive to run than unit tests, as they require simulating or running the entire system. But they are necessary. Dedicate resources and infrastructure to deterministic e2e tests.

Consequence 5: Optimize for Verification Loop Speed

There is a mismatch in incentives between humans and models, mainly with regards to time. Since models do not feel time passing.

Follows from: Property 2 (agents don’t experience time, but you do) + verification being the bottleneck

  • Since verification is the bottleneck and agents iterate by running code repeatedly, the speed of compile → run → check → fix matters enormously.
  • LLMs have no sense of time. They only experience tokens. They’ll wait 2 hours for a Rust compilation without complaint. But you experience time. You’re the one sitting there.
  • Humans offset slow compiles by doing other work while waiting. Agents don’t multitask that way (yet).
  • Optimize for fast everything: fast compiles, fast test runs, fast feedback loops.

Consequence 6: Choose In-Distribution, Simple, Fast Tools

Follows from: Property 3 (intuition is free, reasoning costs context) + Consequence 5 (loop speed)

  • In-distribution tooling: Standard, popular languages and frameworks let the agent rely on parametric knowledge (intuition) — fast, free, no context cost. Esoteric or novel tools force the agent into chain-of-thought reasoning about its environment, which consumes the same finite context it needs for the actual task. This isn’t just slower — it fundamentally reduces how much the agent can accomplish per context window. If you write your web server in APL, even the most powerful model might manage one feature before running out of context.
  • Simplicity over cleverness: Frameworks with lots of magic, implicit behavior, and complex build pipelines create verification friction. You want the simplest, most transparent toolchain possible.
  • Concrete example — why Go works well:
    • Fast compile times (the agent is rebuilding constantly)
    • Fast run times (verification loops stay quick)
    • Simple, readable code (easy for agent to reason about, easy for you to review)
    • Server + frontend in one codebase (less context-switching)
    • Template-based rendering (easy to debug)
    • Go code is “fluffy” (verbose, not dense) — bad for token efficiency, but great for comprehension
    • You might never choose Go for yourself (the type system complaints are valid). But Go optimizes for the things that matter in agent-driven workflows.
  • What to avoid: Frameworks like Next.js that are optimized for developer ergonomics — a bottleneck that no longer exists. Languages like Rust where compile times create painful iteration loops for the human observer, even though the agent doesn’t care.

Consequence 7: Parallelize Independent Work Across Agent Sessions

Follows from: Property 1 (each agent has its own context window)

  • When features are independent, spin up separate Claude Code sessions on separate git branches
  • Each agent works in isolation on its own feature
  • Natural parallelism: one human architect, multiple agent builders, multiple features progressing simultaneously

Conclusion / Looking Forward

  • The value of a developer is shifting from code production to system design + verification infrastructure
  • Frameworks and tooling built for the pre-AI era are over-optimized for the wrong bottleneck
  • New tools and languages may emerge that are optimized for agent-driven development rather than human-driven development
  • These principles hold as long as we’re using transformer-based, finite-context, token-native architectures. When that changes, the game changes. Until then, work with the architecture, not against it.

  1. Assuming a 128k vocab size, 1 token can encode up to bits. So a 200k context window would take seconds to fill up. An expected objection to this naive calculation is that humans don’t use, nor are concious of, the whole 8.75 mbps of visual data — most of it is immediately forgotten. I argue it doesn’t matter. A language model’s context is its perceptive field — all information that it does ICL over. A human does ICL over that much data in 0.38 seconds. ↩︎

✦ No LLMs were used in the ideation, research, writing, or editing of this article.