ram is all you need

2026-2-28

"Man, these LLMs are the smartest POS software engineers I have ever met"

hi. it's been a while.

i've been spending an unreasonable amount of time staring at the exact shape of data flowing into and out of an agent. not the vibes of it. the actual anatomy, what specific pieces of information does a model attend to when it decides what token comes next?

if you don't care about AI, sorry, this one's going to hurt. but if you're even mildly curious about why these systems produce brilliant code one minute and absolute garbage the next, i think the mechanics are genuinely fascinating.

there are really only three knobs that determine how good a response is:

the pre-training. the soul of the model. terabytes of internet poured into weights.
the fine-tuning. RLHF, guardrails, the stuff labs do after the fact.
the context. everything in the prompt right now, at inference time.

two of those knobs belong to the labs. we get the third one. just 1/3 of the pie. but here's what's wild, that final third might actually matter more than the other two combined.

the string

to understand why, you need to internalize one thing: LLMs are stateless.

there's no memory. there's no hidden reasoning engine running between requests. the only way to get better tokens out is to put better tokens in. everything, everything, the model knows about your problem is whatever you managed to cram into the current sequence.

nobody lets a user talk to a model raw. every production system wraps the user's query inside what i've been calling a harness, a middleware layer that reconstructs reality into one big string before the model ever sees it.

let's build one up from scratch.

first, the system prompt. the ground rules.

<context>
  <system_policies>
  - you are an autonomous coding agent
  - you will not assist in harmful activities
  </system_policies>
</context>

a system prompt alone is useless though. the agent doesn't know what kind of project it's working in. so we layer on developer context.

<context>
  <system_policies>
  - you are an autonomous coding agent
  - you will not assist in harmful activities
  </system_policies>

  <developer_context>
  - prioritize answers about next.js and react
  </developer_context>
</context>

then memory. the model needs to know what already happened in this conversation, so we append a compressed chat history.

<context>
  <system_policies> ... </system_policies>
  <developer_context> ... </developer_context>

  <chat_history_summarized>
  ... [compressed prior semantic context] ...
  </chat_history_summarized>
</context>

and only then, at the very bottom, do we finally stick in what the user actually typed.

<context>
  <system_policies> ... </system_policies>
  <developer_context> ... </developer_context>
  <chat_history_summarized> ... </chat_history_summarized>

  <user_prompt>
  "how do I center a div?"
  </user_prompt>
</context>

all of those layers? they get concatenated into a single finite token array and shoved into the model. here's what that looks like as it accumulates over time:

LLM Context Window

Context Composition

Training context (soul.md)

System Prompt

Developer Context

User Prompt

see how fast the foundational stuff gets buried? the system prompt, the developer rules, they're down at the bottom of the cylinder, getting crushed under the weight of every new user message.

it gets worse.

a chatbot that can't do anything is just autocomplete with a personality. if you want real agent behavior, reading files, running code, making API calls, you need to give it tools. so we inject the MCP state: a live JSON manifest of every tool the agent has access to, with full parameter schemas.

<context>
  <system_policies> ... </system_policies>
  <developer_context> ... </developer_context>

  <mcp_state>
  - tool_1: execute_code: { schema: { ... massive json ... } }
  - tool_2: read_file: { schema: { ... massive json ... } }
  <!-- ... 15 more tools ... -->
  </mcp_state>

  <chat_history_summarized> ... </chat_history_summarized>
  <user_prompt> ... </user_prompt>
</context>

tool definitions are heavy. like, startlingly heavy. each one can be hundreds of tokens of JSON schema. fifteen tools and you've burnt a meaningful chunk of your context budget before the user even says hello.

LLM Context Window

Context Composition

Training context (soul.md)

System Prompt

Developer Context

MCP tools bloat

User Prompt

look at all that green. the tool schemas are eating the cylinder alive. the system directives, the things that tell the model who it is and what it's not allowed to do, are getting pushed further and further away from where the model's attention actually lands.

and here's the security implication that keeps people up at night: transformer attention is heavily biased toward the end of the sequence. if your safety policy is sitting 128k tokens away at the top while the user's prompt is right at the bottom, the model effectively forgets the policy exists. this is, quite literally, how jailbreaks work.

so we do something clever. we wrap the identity. we repeat the core constraints at the very end.

<context>
  <system_policies>
  - you are an autonomous agent
  - you will not assist in harmful or illegal activities
  </system_policies>

  <developer_context>
  - prioritize answers about software engineering
  </developer_context>

  <mcp_state>
  - tool_1: execute_code: { schema: { ... massive json ... } }
  - tool_2: read_file: { schema: { ... massive json ... } }
  <!-- ... 15 more tools ... -->
  </mcp_state>

  <chat_history_summarized>
  ... [compressed prior semantic context] ...
  </chat_history_summarized>

  <user_prompt>
  "how do I center a div?"
  </user_prompt>

  <system_reminder>
  remember your core directives before predicting the next token.
  </system_reminder>
</context>

now the critical policies appear at both the beginning and the end of the sequence. the attention mechanism can't avoid them.

trajectories and the dumb zone

dex had a great framing for this in his [context engineering talk]: stop vibe coding.

vibe coding is when you yell at an agent over and over until something compiles. think about what that does to the context. every failed attempt, every "no try again", every "that's wrong fix it", all of that stays in the sequence. the model reads its own failures and concludes, based on the statistical weight of the conversation so far, that the most probable next move is to fail again. the context has poisoned itself. dex calls this a bad trajectory.

here's the thing that nobody talks about though: even with a good trajectory, you're still losing.

models advertise 200k token windows. but if you actually push past 40-60% utilization, especially with bloated tool schemas competing for space, reasoning quality drops off a cliff. i've been calling this the dumb zone. the model technically has capacity left, but it can't think well anymore. you end up with 20,000-line PRs full of hallucinated imports and copy-pasted spaghetti.

the fix is stupidly simple in concept, annoyingly disciplined in practice: don't let the context get that full. break the work into phases, and between each phase, flush the window and start clean.

phase 1: research

resist the urge to ask the agent to write code. seriously. the moment it generates code, its attention shifts toward its own output and away from understanding the problem.

instead, use it as a research tool. have it read the relevant files, trace the data flow, and write its findings into a document. not code. a document.

<user_prompt>
  we have a terrible race condition in the auth flow.
  do not change any files.
  explore `auth.ts`, `middleware.ts`, and `session.ts`.
  write a comprehensive markdown summary of how the state flows
  and exactly where the race condition occurs into `research.md`.
</user_prompt>

when it's done? delete the chat. nuke it. that entire exploration, every file it read, every wrong turn it took, flush it all. the only thing that survives is research.md.

phase 2: planning

fresh window. clean context. the agent doesn't need to re-read the codebase because research.md already contains everything it learned, compressed down to the essentials.

<context>
  <file_context>
    # research.md
    the race condition occurs because `middleware.ts` fires before `auth.ts`
    populates the session cookie.
  </file_context>
</context>

<user_prompt>
  based on `research.md`, draft a step-by-step implementation plan
  to fix the race condition. outline the exact file changes.
  save this to `plan.md`.
</user_prompt>

plan's done? flush again.

phase 3: implementation

now you load only plan.md and the minimum files needed for the current step. that's it. the context is lean. the model can actually reason.

<context>
  <file_context>
    # plan.md
    step 1: modify `middleware.ts` to await session verification.
    step 2: ...
  </file_context>
</context>

<user_prompt>
  execute step 1 of the plan.
</user_prompt>

step 1 done? update the plan, flush, reload, do step 2. you never let the context grow unchecked.

it's tedious. but it works absurdly well.

bash, skills, and the just-bash fiasco

all of that context discipline addresses how much ends up in the window. but there's a separate question that researchers at vercel labs ran into: what kind of data are we putting in there?

for a while, the obvious move was to give agents a real bash shell. let them grep, find, cat, do whatever a human would do in a terminal. and honestly? during the research phase, a raw shell is incredible. when you're stumbling through an unfamiliar codebase trying to figure out where anything lives, there's nothing more powerful than piping find | grep | head.

the problem is what happens when you leave that shell open during implementation.

the agent runs ls -R looking for a file. three thousand lines of directory listing get dumped into the context. it runs cat package-lock.json trying to check a dependency version. forty thousand lines. gone. the cylinder fills with noise.

and here's what makes it genuinely painful: the answer was in there. the one line the agent actually needed? it's in the output. buried under an avalanche of irrelevant stdout. a tiny green crumb of useful context, drowning in red.

LLM Context Window

Context Composition

Training context (soul.md)

System Prompt

Developer Context

Bash results

User Prompt

vercel's solution was to build agent skills, specifically a tool called bash-tool (backed by just-bash). instead of a real shell, it's a sandboxed, in-memory simulation written in typescript. the agent gets access to grep, find, jq, but they operate over a virtual filesystem. the output is constrained. predictable. small.

<!-- instead of dumping the whole file... -->
<user_prompt>
  use your bash skill to run:
  grep "getSession" auth.ts
</user_prompt>

<!-- ...just the line you need -->
<tool_response>
  export const getSession = async () => { ... }
</tool_response>

the distinction matters. skills decouple exploring from flooding. the agent can still look around. it just can't accidentally nuke its own context budget doing it.

LLM Context Window

Context Composition

Training context (soul.md)

System Prompt

Developer Context

Surgical Skill Result

User Prompt

look at how different the cylinder is now. the green blocks are tiny. the orange user prompts are smaller too, when the agent can fetch its own context, you don't have to hand-hold it with massive instructions.

one real caveat though: skills are abstractions, and abstractions rot. when the underlying libraries ship breaking changes, your simulated tools break silently. you have to maintain them. that's the tax you pay for not dumping raw stdout everywhere.

the compaction cycle

skills fix how data enters the context. but context still accumulates. every tool response, every model reply, every follow-up question, it all stacks. even with surgical skills, you'll eventually hit the dumb zone if you don't actively manage the window.

this is what the research/plan/implement cycle is actually doing mechanically: compaction. you let the context fill up with exploratory noise, then you crush it down into a dense artifact, throw away the noise, and start the next phase on a clean foundation.

Context Compaction Cycle

Compaction Cycle

Accumulated Context

Compaction Piston

Compressed Summary

that cyan block at the bottom? that's your research.md or plan.md. the entire exploration history, hundreds of tool calls, dead ends, wrong turns, compressed into one dense, surviving artifact. and then the cycle starts again. new context fills on top of the compacted base, gets compacted again, and so on.

[opencode] takes this idea further with what they call lore, persistent compressed memory that survives across sessions. you close the terminal, come back tomorrow, and the agent already knows your project's architecture because it loaded its own compacted understanding from last time. pre-compacted context before you've typed a single character.

semantic indexing

there's one more approach worth looking at, and it comes from a different angle entirely.

skills are reactive. the agent decides it needs to know about getSession, fires a skill, gets the result. but what if the agent didn't need to ask at all? what if the entire codebase was already pre-digested into a queryable index?

that's what btca does. it clones a repository, builds a semantic map, function signatures, module boundaries, type hierarchies, dependency graphs, and exposes the whole thing as a structured interface. the agent doesn't read files. it queries a knowledge graph.

Semantic Code Indexer

Semantic Indexer

Raw Code File

Semantic Shard

Index Filter

the red blocks are raw code files. they hit the blue filter membrane and get absorbed. what comes out the other side is tiny, precise, pre-extracted semantic shards. the agent never sees the noise. it only sees what it would have extracted anyway, except the extraction happened at index time, not at inference time.

this is a fundamentally different tradeoff than skills. skills are cheap but reactive. indexing is expensive upfront but gives the agent pre-computed context that costs almost nothing to retrieve. for a 50-file project, skills are fine. for a quarter-million-line monorepo? you probably want an index.

alignment and the hierarchy

there's a broader point here that goes beyond token counts.

when an agent produces a 1,000-line PR, reviewing the diff is a nightmare. but reviewing the 50-line plan.md that generated it? easy. this is what i mean by mental alignment, keeping the human team synchronized on why the architecture is changing, not just what changed. you review the plan, not the output. the plan is the source code. the code is the compiled artifact.

this completely changes the cost structure of mistakes:

bad research → thousands of wasted lines. the agent solved the wrong problem.
bad plan → hundreds of broken lines. it solved the right problem wrong.
bad implementation → trivial. the plan was right, the agent just fumbled a function signature. easy fix.

the earlier the mistake, the more expensive it is. which means the most valuable thing you can do isn't writing better code. it's writing better prompts. better research docs. better plans. everything upstream of the code.

harness the context. compact aggressively. index when you can. and always, always wrap the identity.

comment on bluesky / mastodon / x / rss