simonwillison.net - Agentic Engineering Patterns - Simon Willison's Weblog

Patterns for getting the best results out of coding agents like Claude Code and OpenAI Codex.

Part 1: Principles

1.1 What is Agentic Engineering?

I use the term agentic engineering to describe the practice of developing software with the assistance of coding agents.

What are coding agents? They’re agents that can both write and execute code. Popular examples include Claude Code, OpenAI Codex, and Gemini CLI.

What’s an agent? Clearly defining that term is a challenge that has frustrated AI researchers since at least the 1990s but the definition I’ve come to accept, at least in the field of Large Language Models (LLMs) like GPT-5 and Gemini and Claude, is this one:

Agents run tools in a loop to achieve a goal

The “agent” is software that calls an LLM with your prompt and passes it a set of tool definitions, then calls any tools that the LLM requests and feeds the results back into the LLM.

For coding agents, those tools include one that can execute code.

You prompt the coding agent to define a goal. The agent then generates and executes code in a loop until that goal has been met.

Code execution is the defining capability that makes agentic engineering possible. Without the ability to directly run the code, anything output by an LLM is of limited value. With code execution, these agents can start iterating towards software that demonstrably works.

Agentic Engineering

Now that we have software that can write working code, what is there left for us humans to do?

The answer is so much stuff.

Writing code has never been the sole activity of a software engineer. The craft has always been figuring out what code to write. Any given software problem has dozens of potential solutions, each with their own tradeoffs. Our job is to navigate those options and find the ones that are the best fit for our unique set of circumstances and requirements.

Getting great results out of coding agents is a deep subject in its own right, especially now as the field continues to evolve at a bewildering rate.

We need to provide our coding agents with the tools they need to solve our problems, specify those problems in the right level of detail, and verify and iterate on the results until we are confident they address our problems in a robust and credible way.

LLMs don’t learn from their past mistakes, but coding agents can, provided we deliberately update our instructions and tool harnesses to account for what we learn along the way.

Used effectively, coding agents can help us be much more ambitious with the projects we take on. Agentic engineering should help us produce more, better quality code that solves more impactful problems.

Isn’t This just Vibe Coding?

The term “vibe coding” was coined by Andrej Karpathy in February 2025 — coincidentally just three weeks prior to the original release of Claude Code — to describe prompting LLMs to write code while you “forget that the code even exists”.

Some people extend that definition to cover any time an LLM is used to produce code at all, but I think that’s a mistake. Vibe coding is more useful in its original definition — we need a term to describe unreviewed, prototype-quality LLM-generated code that distinguishes it from code that the author has brought up to a production ready standard.

About This Guide

Just like the field it attempts to cover, Agentic Engineering Patterns is very much a work in progress. My goal is to identify and describe patterns for working with these tools that demonstrably get results, and that are unlikely to become outdated as the tools advance.

I’ll continue adding more chapters as new techniques emerge. No chapter should be considered finished. I’ll be updating existing chapters as our understanding of these patterns evolves.

1.2 Writing Code is Cheap now

The biggest challenge in adopting agentic engineering practices is getting comfortable with the consequences of the fact that writing code is cheap now.

Code has always been expensive. Producing a few hundred lines of clean, tested code takes most software developers a full day or more. Many of our engineering habits, at both the macro and micro level, are built around this core constraint.

At the macro level we spend a great deal of time designing, estimating and planning out projects, to ensure that our expensive coding time is spent as efficiently as possible. Product feature ideas are evaluated in terms of how much value they can provide in exchange for that time — a feature needs to earn its development costs many times over to be worthwhile!

At the micro level we make hundreds of decisions a day predicated on available time and anticipated tradeoffs. Should I refactor that function to be slightly more elegant if it adds an extra hour of coding time? How about writing documentation? Is it worth adding a test for this edge case? Can I justify building a debug interface for this?

Coding agents dramatically drop the cost of typing code into the computer, which disrupts so many of our existing personal and organizational intuitions about which trade-offs make sense.

The ability to run parallel agents makes this even harder to evaluate, since one human engineer can now be implementing, refactoring, testing and documenting code in multiple places at the same time.

Good Code Still Has a Cost

Delivering new code has dropped in price to almost free… but delivering good code remains significantly more expensive than that.

Here’s what I mean by “good code”:

The code works. It does what it’s meant to do, without bugs.
We know the code works. We’ve taken steps to confirm to ourselves and to others that the code is fit for purpose.
It solves the right problem.
It handles error cases gracefully and predictably: it doesn’t just consider the happy path. Errors should provide enough information to help future maintainers understand what went wrong.
It’s simple and minimal — it does only what’s needed, in a way that both humans and machines can understand now and maintain in the future.
It’s protected by tests. The tests show that it works now and act as a regression suite to avoid it quietly breaking in the future.
It’s documented at an appropriate level, and that documentation reflects the current state of the system — if the code changes an existing behavior the existing documentation needs to be updated to match.
The design affords future changes. It’s important to maintain YAGNI — code with added complexity to anticipate future changes that may never come is often bad code — but it’s also important not to write code that makes future changes much harder than they should be.
All of the other relevant “ilities” — accessibility, testability, reliability, security, maintainability, observability, scalability, usability — the non-functional quality measures that are appropriate for the particular class of software being developed.

Coding agent tools can help with most of this, but there is still a substantial burden on the developer driving those tools to ensure that the produced code is good code for the subset of good that’s needed for the current project.

We Need to Build New Habits

The challenge is to develop new personal and organizational habits that respond to the affordances and opportunities of agentic engineering.

These best practices are still being figured out across our industry. I’m still figuring them out myself.

For now I think the best we can do is to second guess ourselves: any time our instinct says “don’t build that, it’s not worth the time” fire off a prompt anyway, in an asynchronous agent session where the worst that can happen is you check ten minutes later and find that it wasn’t worth the tokens.

1.3 Hoard Things You Know how to Do

Many of my tips for working productively with coding agents are extensions of advice I’ve found useful in my career without them. Here’s a great example of that: hoard things you know how to do.

A big part of the skill in building software is understanding what’s possible and what isn’t, and having at least a rough idea of how those things can be accomplished.

These questions can be broad or quite obscure. Can a web page run OCR operations in JavaScript alone? Can an iPhone app pair with a Bluetooth device even when the app isn’t running? Can we process a 100GB JSON file in Python without loading the entire thing into memory first?

The more answers to questions like this you have under your belt, the more likely you’ll be able to spot opportunities to deploy technology to solve problems in ways other people may not have thought of yet.

The best way to be confident in answers to these questions is to have seen them illustrated by running code. Knowing that something is theoretically possible is not the same as having seen it done for yourself. A key asset to develop as a software professional is a deep collection of answers to questions like this, accompanied by proof of those answers.

I hoard solutions like this in a number of different ways. My blog and TIL blog are crammed with notes on things I’ve figured out how to do. I have over a thousand GitHub repos collecting code I’ve written for different projects, many of them small proof-of-concepts that demonstrate a key idea.

More recently I’ve used LLMs to help expand my collection of code solutions to interesting problems.

tools.simonwillison.net is my largest collection of LLM-assisted tools and prototypes. I use this to collect what I call HTML tools — single HTML pages that embed JavaScript and CSS and solve a specific problem.

My simonw/research repository has larger, more complex examples where I’ve challenged a coding agent to research a problem and come back with working code and a written report detailing what it found out.

Recombining Things from Your Hoard

Why collect all of this stuff? Aside from helping you build and extend your own abilities, the assets you generate along the way become powerful inputs for your coding agents.

One of my favorite prompting patterns is to tell an agent to build something new by combining two or more existing working examples.

A project that helped crystallize how effective this can be was the first thing I added to my tools collection — a browser-based OCR tool. I wanted an easy, browser-based tool for OCRing pages from PDF files — in particular PDFs that consist entirely of scanned images with no text version provided at all.

I had previously experimented with running the Tesseract.js OCR library in my browser, and found it to be very capable. I didn’t want to work with images though, I wanted to work with PDFs. Then I remembered that I had also worked with Mozilla’s PDF.js library, which among other things can turn individual pages of a PDF into rendered images.

I had snippets of JavaScript for both of those libraries in my notes. I combined them into a single prompt and got back a working tool in minutes.

Coding Agents Make This even More Powerful

I built that OCR example back in March 2024, nearly a year before the first release of Claude Code. Coding agents have made hoarding working examples even more valuable.

If your coding agent has internet access you can tell it to do things like:

Use curl to fetch the source of https://tools.simonwillison.net/ocr and
https://tools.simonwillison.net/gemini-bbox and build a new tool that lets you
select a page from a PDF and pass it to Gemini to return bounding boxes for
illustrations on that page.

(I specified curl there because Claude Code defaults to using a WebFetch tool which summarizes the page content rather than returning the raw HTML.)

Coding agents are excellent at search, which means you can run them on your own machine and tell them where to find the examples of things you want them to do:

Add mocked HTTP tests to the ~/dev/ecosystem/datasette-oauth project inspired
by how ~/dev/ecosystem/llm-mistral is doing it.

Since so much of my research code is public I’ll often tell coding agents to clone my repositories to /tmp and use them as input:

Clone simonw/research from GitHub to /tmp and find examples of compiling Rust
to WebAssembly, then use that to build a demo HTML page for this project.

The key idea here is that coding agents mean we only ever need to figure out a useful trick once. If that trick is then documented somewhere with a working code example our agents can consult that example and use it to solve any similar shaped project in the future.

1.4 AI Should Help Us Produce Better Code

Many developers worry that outsourcing their code to AI tools will result in a drop in quality, producing bad code that’s churned out fast enough that decision makers are willing to overlook its flaws.

If adopting coding agents demonstrably reduces the quality of the code and features you are producing, you should address that problem directly: figure out which aspects of your process are hurting the quality of your output and fix them.

Shipping worse code with agents is a choice. We can choose to ship better code instead.

Avoiding Taking on Technical Debt

I like to think about shipping better code in terms of technical debt. We take on technical debt as the result of trade-offs: doing things “the right way” would take too long, so we work within the time constraints we are under and cross our fingers that our project will survive long enough to pay down the debt later on.

The best mitigation for technical debt is to avoid taking it on in the first place.

In my experience, a common category of technical debt fixes is changes that are simple but time-consuming.

Our original API design doesn’t cover an important case that emerged later on. Fixing that API would require changing code in dozens of different places, making it quicker to add a very slightly different new API and live with the duplication.
We made a poor choice naming a concept early on — teams rather than groups for example — but cleaning up that nomenclature everywhere in the code is too much work so we only fix it in the UI.
Our system has grown duplicate but slightly different functionality over time which needs combining and refactoring.
One of our files has grown to several thousand lines of code which we would ideally split into separate modules.

All of these changes are conceptually simple but still need time dedicated to them, which can be hard to justify given more pressing issues.

Coding Agents Can Handle These for Us

Refactoring tasks like this are an ideal application of coding agents.

Fire up an agent, tell it what to change and leave it to churn away in a branch or worktree somewhere in the background.

I usually use asynchronous coding agents for this such as Gemini Jules, OpenAI Codex web, or Claude Code on the web. That way I can run those refactoring jobs without interrupting my flow on my laptop.

Evaluate the result in a Pull Request. If it’s good, land it. If it’s almost there, prompt it and tell it what to do differently. If it’s bad, throw it away.

The cost of these code improvements has dropped so low that we can afford a zero tolerance attitude to minor code smells and inconveniences.

AI Tools Let Us Consider More Options

Any software development task comes with a wealth of options for approaching the problem. Some of the most significant technical debt comes from making poor choices at the planning step — missing out on an obvious simple solution, or picking a technology that later turns out not to be exactly the right fit.

LLMs can help ensure we don’t miss any obvious solutions that may not have crossed our radar before. They’ll only suggest solutions that are common in their training data but those tend to be the Boring Technology that’s most likely to work.

More importantly, coding agents can help with exploratory prototyping.

The best way to make confident technology choices is to prove that they are fit for purpose with a prototype.

Coding agents can build this kind of simulation from a single well crafted prompt, which drops the cost of this kind of experiment to almost nothing. And since they’re so cheap we can run multiple experiments at once, testing several solutions to pick the one that is the best fit for our problem.

Embrace the Compound Engineering Loop

Agents follow instructions. We can evolve these instructions over time to get better results from future runs, based on what we’ve learned previously.

Dan Shipper and Kieran Klaassen at Every describe their company’s approach to working with coding agents as Compound Engineering. Every coding project they complete ends with a retrospective, which they call the compound step where they take what worked and document that for future agent runs.

If we want the best results from our agents, we should aim to continually increase the quality of our codebase over time. Small improvements compound. Quality enhancements that used to be time-consuming have now dropped in cost to the point that there’s no excuse not to invest in quality at the same time as shipping new features. Coding agents mean we can finally have both.

1.5 Anti-patterns: Things to Avoid

There are some behaviors that are anti-patterns in our weird new world of agentic engineering.

Inflicting Unreviewed Code on Collaborators

This anti-pattern is common and deeply frustrating.

Don’t file pull requests with code you haven’t reviewed yourself.

If you open a PR with hundreds (or thousands) of lines of code that an agent produced for you, and you haven’t done the work to ensure that code is functional yourself, you are delegating the actual work to other people.

They could have prompted an agent themselves. What value are you even providing?

If you put code up for review you need to be confident that it’s ready for other people to spend their time on it. The initial review pass is your responsibility, not something you should farm out to others.

A good agentic engineering pull request has the following characteristics:

The code works, and you are confident that it works. Your job is to deliver code that works.
The change is small enough to be reviewed efficiently without inflicting too much additional cognitive load on the reviewer. Several small PRs beats one big one, and splitting code into separate commits is easy with a coding agent to do the Git finagling for you.
The PR includes additional context to help explain the change. What’s the higher level goal that the change serves? Linking to relevant issues or specifications is useful here.
Agents write convincing looking pull request descriptions. You need to review these too! It’s rude to expect someone else to read text that you haven’t read and validated yourself.

Given how easy it is to dump unreviewed code on other people, I recommend including some form of evidence that you’ve put that extra work in yourself. Notes on how you manually tested it, comments on specific implementation choices or even screenshots and video of the feature working go a long way to demonstrating that a reviewer’s time will not be wasted digging into the details.

Part 2: Working with Coding Agents

2.1 How Coding Agents Work

As with any tool, understanding how coding agents work under the hood can help you make better decisions about how to apply them.

A coding agent is a piece of software that acts as a harness for an LLM, extending that LLM with additional capabilities that are powered by invisible prompts and implemented as callable tools.

Large Language Models

At the heart of any coding agent is a Large Language Model, or LLM. These have names like GPT-5.4 or Claude Opus 4.6 or Gemini 3.1 Pro or Qwen3.5-35B-A3B.

An LLM is a machine learning model that can complete a sentence of text. Give the model the phrase “the cat sat on the ” and it will (almost certainly) suggest “mat” as the next word in the sentence.

As these models get larger and train on increasing amounts of data, they can complete more complex sentences — like “a python function to download a file from a URL is def download_file(url):”.

LLMs don’t actually work directly with words — they work with tokens. A sequence of text is converted into a sequence of integer tokens. This is worth understanding because LLM providers charge based on the number of tokens processed, and are limited in how many tokens they can consider at a time.

The input to an LLM is called the prompt. The text returned by an LLM is called the completion, or sometimes the response.

Many models today are multimodal, which means they can accept more than just text as input. Vision LLMs (vLLMs) can accept images as part of the input, which means you can feed them sketches or photos or screenshots. A common misconception is that these are run through a separate process for OCR or image analysis, but these inputs are actually turned into yet more token integers which are processed in the same way as text.

Chat Templated Prompts

The first LLMs worked as completion engines. This wasn’t particularly user-friendly so models mostly switched to using chat templated prompts instead, which represent communication with the model as a simulated conversation.

This is actually just a form of completion prompt with a special format that looks something like this:

user: write a python function to download a file from a URL
assistant:

The natural completion for this prompt is for the assistant to answer the user’s question with some Python code.

LLMs are stateless: every time they execute a prompt they start from the same blank slate. To maintain the simulation of a conversation, the software that talks to the model needs to maintain its own state and replay the entire existing conversation every time the user enters a new chat prompt.

Since providers charge for both input and output tokens, this means that as a conversation gets longer, each prompt becomes more expensive since the number of input tokens grows every time.

Token Caching

Most model providers offset this somewhat through a cheaper rate for cached input tokens — common token prefixes that have been processed within a short time period can be charged at a lower rate as the underlying infrastructure can cache and then reuse many of the expensive calculations used to process that input.

Coding agents are designed with this optimization in mind — they avoid modifying earlier conversation content to ensure the cache is used as efficiently as possible.

Calling Tools

The defining feature of an LLM agent is that agents can call tools. A tool is a function that the agent harness makes available to the LLM.

At the level of the prompt itself, a tool invocation might look like:

system: If you need to access the weather, end your turn with <tool>get_weather(city_name)</tool>
user: what's the weather in San Francisco?
assistant: <tool>get_weather("San Francisco")</tool>

The model harness software then extracts that function call request from the response, executes the tool, and returns the result back to the model. Most coding agents define a dozen or more tools. The most powerful of these allow for code execution — a Bash() tool for executing terminal commands, or a Python() tool for running Python code, for example.

The System Prompt

Coding agents usually start every conversation with a system prompt, which is not shown to the user but provides instructions telling the model how it should behave. These system prompts can be hundreds of lines long.

Reasoning

One of the big new advances in 2025 was the introduction of reasoning to the frontier model families. Reasoning, sometimes presented as thinking in the UI, is when a model spends additional time generating text that talks through the problem and its potential solutions before presenting a reply to the user.

This can look similar to a person thinking out loud, and has a similar effect. Crucially it allows models to spend more time (and more tokens) working on a problem in order to hopefully get a better result.

Reasoning is particularly useful for debugging issues in code as it gives the model an opportunity to navigate more complex code paths, mixing in tool calls and using the reasoning phase to follow function calls back to the potential source of an issue.

Many coding agents include options for dialing up or down the reasoning effort level, encouraging models to spend more time chewing on harder problems.

LLM + System Prompt + Tools in a Loop

Believe it or not, that’s most of what it takes to build a coding agent!

If you want to develop a deeper understanding of how these things work, a useful exercise is to try building your own agent from scratch. A simple tool loop can be achieved with a few dozen lines of code on top of an existing LLM API.

A good tool loop is a great deal more work than that, but the fundamental mechanics are surprisingly straightforward.

2.2 Using Git with Coding Agents

Git is a key tool for working with coding agents. Keeping code in version control lets us record how that code changes over time and investigate and reverse any mistakes. All of the coding agents are fluent in using Git’s features, both basic and advanced.

This fluency means we can be more ambitious about how we use Git ourselves. We don’t need to memorize how to do things with Git, but staying aware of what’s possible means we can take advantage of the full suite of Git’s abilities.

Git Essentials

Each Git project lives in a repository — a folder on disk that can track changes made to the files within it. Those changes are recorded in commits — timestamped bundles of changes to one or more files accompanied by a commit message describing those changes and an author recording who made them.

Git supports branches, which allow you to construct and experiment with new changes independently of each other. Branches can then be merged back into your main branch once they are deemed ready.

Git repositories can be cloned onto a new machine, and that clone includes both the current files and the full history of changes to them. This means developers — or coding agents — can browse and explore that history without any extra network traffic, making history diving effectively free.

Git repositories can live just on your own machine, but Git is designed to support collaboration and backups by publishing them to a remote. GitHub is the most popular place for these remotes.

Core Concepts and Prompts

Coding agents all have a deep understanding of Git jargon. The following prompts should work with any of them:

Start a new Git repo here — To turn the folder the agent is working in into a Git repository.

Commit these changes — Create a new Git commit to record the changes the agent has made.

Add username/repo as a github remote — This should configure your repository for GitHub. You’ll need to create a new repo first using github.com/new, and configure your machine to talk to GitHub.

Review changes made today (or “recent changes” or “last three commits”) — This is a great way to start a fresh coding agents session. Telling the agent to look at recent changes causes it to run git log, which can instantly load its context with details of what you have been working on recently — both the modified code and the commit messages that describe it.

Integrate latest changes from main — Run this on your main branch to fetch other contributions from the remote repository, or run it in a branch to integrate the latest changes on main.

If you can’t remember the details of different merge strategies, just ask:

Discuss options for integrating changes from main

Agents are great at explaining the pros and cons of different merging strategies, and everything in git can always be undone so there’s minimal risk in trying new things.

Sort out this git mess for me — I use this universal prompt surprisingly often! Coding agents can navigate the most Byzantine of merge conflicts, reasoning through the intent of the new code and figuring out what to keep and how to combine conflicting changes.

Find and recover my code that does … — Git has a mechanism called the reflog which can often capture details of code that hasn’t been committed to a permanent branch. Agents can search that, and search other branches too.

Use git bisect to find when this bug was introduced: … — Git bisect is one of the most powerful debugging tools in Git’s arsenal. When you run a bisect operation you provide Git with some kind of test condition and a start and ending commit range. Git then runs a binary search to identify the earliest commit for which your test condition fails. Coding agents can handle this boilerplate for you, upgrading git bisect from an occasional use tool to one you can deploy any time you are curious about the historic behavior of your software.

Rewriting History

The commit history of a Git repository is not fixed. Git itself provides tools that can be used to modify that history.

Don’t think of the Git history as a permanent record of what actually happened — instead consider it to be a deliberately authored story that describes the progression of the software project.

Undo last commit — It’s common to commit code and then regret it. The git recipe for this is git reset --soft HEAD~1. I’ve never been able to remember that, and now I don’t have to!

Remove uv.lock from that last commit — You can also perform more finely grained surgery on commits — rewriting them to remove just a single file, for example.

Combine last three commits with a better commit message — Agents can rewrite commit messages and can combine multiple commits into a single unit.

I’ve found that frontier models usually have really good taste in commit messages. I used to insist on writing these myself but I’ve accepted that the quality they produce is generally good enough, and often even better than what I would have produced myself.

Building a New Repository from Scraps of an Older One

A trick I find myself using quite often is extracting out code from a larger repository into a new one while maintaining the key history of that code:

Start a new repo at /tmp/distance-functions and build a Python library there
with the lib/distance_functions.py module from here - build a similar commit
history copying the author and commit dates in the new repo

This kind of operation used to be involved enough that most developers would create a fresh copy detached from that old commit history. We don’t have to settle for that any more!

2.3 Subagents

LLMs are restricted by their context limit — how many tokens they can fit in their working memory at any given time. These values have not increased much over the past two years even as the LLMs themselves have seen dramatic improvements in their abilities — they generally top out at around 1,000,000, and benchmarks frequently report better quality results below 200,000.

Carefully managing the context such that it fits within those limits is critical to getting great results out of a model.

Subagents provide a simple but effective way to handle larger tasks without burning through too much of the coding agent’s valuable top-level context.

When a coding agent uses a subagent it effectively dispatches a fresh copy of itself to achieve a specified goal, with a new context window that starts with a fresh prompt.

Claude Code’s Explore Subagent

Claude Code uses subagents extensively as part of its standard way of working. Any time you start a new task against an existing repo Claude Code first needs to explore that repo to figure out its general shape and find relevant information needed to achieve that task.

It does this by constructing a prompt and dispatching a subagent to perform that exploration and return a description of what it finds.

Subagents work similar to any other tool call: the parent agent dispatches them just as they would any other tool and waits for the response. It’s interesting to see models prompt themselves in this way — they generally have good taste in prompting strategies.

Parallel Subagents

Subagents can also provide a significant performance boost by having the parent agent run multiple subagents at the same time, potentially also using faster and cheaper models such as Claude Haiku to accelerate those tasks.

Coding agents that support subagents can use them based on your instructions:

Use subagents to find and update all of the templates that are affected by this change.

For tasks that involve editing several files — and where those files are not dependent on each other — this can offer a significant speed boost.

Specialist Subagents

Some coding agents allow subagents to run with further customizations, often in the form of a custom system prompt or custom tools or both, which allow those subagents to take on a different role.

These roles can cover a variety of useful specialties:

A code reviewer agent can review code and identify bugs, feature gaps or weaknesses in the design.
A test runner agent can run the tests. This is particularly worthwhile if your test suite is large and verbose, as the subagent can hide the full test output from the main coding agent and report back with just details of any failures.
A debugger agent can specialize in debugging problems, spending its token allowance reasoning through the codebase and running snippets of code to help isolate steps to reproduce and determine the root cause of a bug.

While it can be tempting to go overboard breaking up tasks across dozens of different specialist subagents, it’s important to remember that the main value of subagents is in preserving that valuable root context and managing token-heavy operations. Your root coding agent is perfectly capable of debugging or reviewing its own output provided it has the tokens to spare.

Official Documentation

Several popular coding agents support subagents:

Part 3: Testing and QA

3.1 Red/green TDD

“Use red/green TDD” is a pleasingly succinct way to get better results out of a coding agent.

TDD stands for Test Driven Development. It’s a programming style where you ensure every piece of code you write is accompanied by automated tests that demonstrate the code works.

The most disciplined form of TDD is test-first development. You write the automated tests first, confirm that they fail, then iterate on the implementation until the tests pass.

This turns out to be a fantastic fit for coding agents. A significant risk with coding agents is that they might write code that doesn’t work, or build code that is unnecessary and never gets used, or both.

Test-first development helps protect against both of these common mistakes, and also ensures a robust automated test suite that protects against future regressions. As projects grow the chance that a new change might break an existing feature grows with them. A comprehensive test suite is by far the most effective way to keep those features working.

It’s important to confirm that the tests fail before implementing the code to make them pass. If you skip that step you risk building a test that passes already, hence failing to exercise and confirm your new implementation.

That’s what “red/green” means: the red phase watches the tests fail, then the green phase confirms that they now pass.

Every good model understands “red/green TDD” as a shorthand for the much longer “use test driven development, write the tests first, confirm that the tests fail before you implement the change that gets them to pass”.

Example prompt:

Build a Python function to extract headers from a markdown string. Use red/green TDD.

3.2 First Run the Tests

Automated tests are no longer optional when working with coding agents.

The old excuses for not writing them — that they’re time consuming and expensive to constantly rewrite while a codebase is rapidly evolving — no longer hold when an agent can knock them into shape in just a few minutes.

They’re also vital for ensuring AI-generated code does what it claims to do. If the code has never been executed it’s pure luck if it actually works when deployed to production.

Tests are also a great tool to help get an agent up to speed with an existing codebase. Watch what happens when you ask Claude Code or similar about an existing feature — the chances are high that they’ll find and read the relevant tests.

Agents are already biased towards testing, but the presence of an existing test suite will almost certainly push the agent into testing new changes that it makes.

Any time I start a new session with an agent against an existing project I’ll start by prompting a variant of the following:

First run the tests

For my Python projects I have pyproject.toml set up such that I can prompt this instead:

Run "uv run pytest"

These four word prompts serve several purposes:

It tells the agent that there is a test suite and forces it to figure out how to run the tests. This makes it almost certain that the agent will run the tests in the future to ensure it didn’t break anything.
Most test harnesses will give the agent a rough indication of how many tests they are. This can act as a proxy for how large and complex the project is, and also hints that the agent should search the tests themselves if they want to learn more.
It puts the agent in a testing mindset. Having run the tests it’s natural for it to then expand them with its own tests later on.

Similar to “Use red/green TDD”, “First run the tests” provides a four word prompt that encompasses a substantial amount of software engineering discipline that’s already baked into the models.

3.3 Agentic Manual Testing

The defining characteristic of a coding agent is that it can execute the code that it writes. This is what makes coding agents so much more useful than LLMs that simply spit out code without any way to verify it.

Never assume that code generated by an LLM works until that code has been executed.

Coding agents have the ability to confirm that the code they have produced works as intended, or iterate further on that code until it does.

Getting agents to write unit tests, especially using test-first TDD, is a powerful way to ensure they have exercised the code they are writing. That’s not the only worthwhile approach, though.

Just because code passes tests doesn’t mean it works as intended. Anyone who’s worked with automated tests will have seen cases where the tests all pass but the code itself fails in some obvious way — it might crash the server on startup, fail to display a crucial UI element, or miss some detail that the tests failed to cover.

Automated tests are no replacement for manual testing. I like to see a feature working with my own eye before I land it in a release.

I’ve found that getting agents to manually test code is valuable as well, frequently revealing issues that weren’t spotted by the automated tests.

Mechanisms for Agentic Manual Testing

For Python libraries a useful pattern is python -c "… code …". You can pass a string (or multiline string) of Python code directly to the Python interpreter, including code that imports other modules.

The coding agents are all familiar with this trick and will sometimes use it without prompting. Reminding them to test using python -c can often be effective though:

Try that new function on some edge cases using `python -c`

For web applications with JSON APIs:

Run a dev server and explore that new JSON API using `curl`

Telling an agent to “explore” often results in it trying out a bunch of different aspects of a new API, which can quickly cover a whole lot of ground.

If an agent finds something that doesn’t work through their manual testing, I like to tell them to fix it with red/green TDD. This ensures the new case ends up covered by the permanent automated tests.

Using Browser Automation for Web UIs

Having a manual testing procedure in place becomes even more valuable if a project involves an interactive web UI.

The most powerful browser automation tool today is Playwright, an open source library developed by Microsoft. Playwright offers a full-featured API with bindings in multiple popular programming languages and can automate any of the popular browser engines.

Simply telling your agent to “test that with Playwright” may be enough.

Rodney is my own project — a browser automation tool which is quick to install and has --help output that’s designed to teach an agent everything it needs to know to use the tool.

Example prompt:

Start a dev server and then use `uvx rodney --help` to test the new homepage,
look at screenshots to confirm the menu is in the right place

There are three tricks in this prompt:

Saying “use uvx rodney --help” causes the agent to run rodney --help via the uvx package management tool, which automatically installs Rodney the first time it is called.
The rodney --help command is specifically designed to give agents everything they need to know to both understand and use the tool.
Saying “look at screenshots” hints to the agent that it should use the rodney screenshot command and reminds it that it can use its own vision abilities against the resulting image files to evaluate the visual appearance of the page.

Have Them Take Notes with Showboat

Having agents manually test code can catch extra problems, but it can also be used to create artifacts that can help document the code and demonstrate how it has been tested.

I built Showboat to facilitate building documents that capture the agentic manual testing flow.

Example prompt:

Run `uvx showboat --help` and then create a `notes/api-demo.md` showboat
document and use it to test and document that new API.

The three key Showboat commands are note, exec, and image.

note appends a Markdown note to the Showboat document.
exec records a command, then runs that command and records its output.
image adds an image to the document — useful for screenshots of web applications taken using Rodney.

The exec command is the most important of these, because it captures a command along with the resulting output. This shows you what the agent did and what the result was, and is designed to discourage the agent from cheating and writing what it hoped had happened into the document.

Part 4: Understanding Code

4.1 Linear Walkthroughs

Sometimes it’s useful to have a coding agent give you a structured walkthrough of a codebase.

Maybe it’s existing code you need to get up to speed on, maybe it’s your own code that you’ve forgotten the details of, or maybe you vibe coded the whole thing and need to understand how it actually works.

Frontier models with the right agent harness can construct a detailed walkthrough to help you understand how code works.

An Example Using Showboat and Present

I recently vibe coded a SwiftUI slide presentation app on my Mac using Claude Code and Opus 4.6. I released the code to GitHub and then realized I didn’t know anything about how it actually worked — I had prompted the whole thing into existence without paying any attention to the code it was writing.

So I fired up a new instance of Claude Code for web, pointed it at my repo and prompted:

Read the source and then plan a linear walkthrough of the code that explains
how it all works in detail. Then run "uvx showboat --help" to learn showboat -
use showboat to create a walkthrough.md file in the repo and build the
walkthrough in there, using showboat note for commentary and showboat exec plus
sed or grep or cat or whatever you need to include snippets of code you are
talking about.

Showboat is a tool I built to help coding agents write documents that demonstrate their work.

The showboat note command adds Markdown to the document. The showboat exec command accepts a shell command, executes it and then adds both the command and its output to the document.

By telling it to use “sed or grep or cat or whatever you need to include snippets of code you are talking about” I ensured that Claude Code would not manually copy snippets of code into the document, since that could introduce a risk of hallucinations or mistakes.

This worked extremely well. Here’s the document Claude Code created with Showboat, which talks through all six .swift files in detail and provides a clear and actionable explanation about how the code works.

I learned a great deal about how SwiftUI apps are structured and absorbed some solid details about the Swift language itself just from reading this document.

If you are concerned that LLMs might reduce the speed at which you learn new skills I strongly recommend adopting patterns like this one. Even a ~40 minute vibe coded toy project can become an opportunity to explore new ecosystems and pick up some interesting new tricks.

4.2 Interactive Explanations

When we lose track of how code written by our agents works we take on cognitive debt.

For a lot of things this doesn’t matter: if the code fetches some data from a database and outputs it as JSON the implementation details are likely simple enough that we don’t need to care. Often though the details really do matter. If the core of our application becomes a black box that we don’t fully understand we can no longer confidently reason about it, which makes planning new features harder and eventually slows our progress in the same way that accumulated technical debt does.

How do we pay down cognitive debt? By improving our understanding of how the code works.

One of my favorite ways to do that is by building interactive explanations.

Understanding Word Clouds

I’ve always wanted to know how word clouds work, so I fired off an asynchronous research project to explore the idea. Claude Code for web built me a Rust CLI tool that could produce word cloud images.

But how does it actually work? Claude’s report said it uses “Archimedean spiral placement with per-word random angular offset for natural-looking layouts”. This did not help me much!

I requested a linear walkthrough of the codebase which helped me understand the Rust code in more detail. This helped me understand the structure of the Rust code but I still didn’t have an intuitive understanding of how that “Archimedean spiral placement” part actually worked.

So I asked for an animated explanation:

Fetch https://raw.githubusercontent.com/simonw/research/refs/heads/main/rust-wordcloud/walkthrough.md
to /tmp using curl so you can read the whole thing. Inspired by that, build
animated-word-cloud.html - a page that accepts pasted text (which it persists
in the #fragment of the URL such that a page loaded with that # populated will
use that text as input and auto-submit it) such that when you submit the text
it builds a word cloud using the algorithm described in that document but does
it animated, to make the algorithm as clear to understand. Include a slider for
the animation which can be paused and the speed adjusted or even stepped through
frame by frame while paused. At any stage the visible in-progress word cloud
can be downloaded as a PNG.

You can play with the result here.

The animation clearly shows that for each word the algorithm attempts to place it somewhere on the page by showing a box, runs checks if that box intersects an existing word — if so it continues to try to find a good spot, moving outward in a spiral from the center.

I found that this animation really helped make the way the algorithm worked click for me.

I have long been a fan of animations and interactive interfaces to help explain different concepts. A good coding agent can produce these on demand to help explain code — its own code or code written by others.

Part 5: Annotated Prompts

5.1 GIF Optimization Tool Using WebAssembly and Gifsicle

I like to include animated GIF demos in my online writing, often recorded using LICEcap. These GIFs can be pretty big. My favorite tool for optimizing GIF file size is Gifsicle by Eddie Kohler. It compresses GIFs by identifying regions of frames that have not changed and storing only the differences, and can optionally reduce the GIF color palette or apply visible lossy compression for greater size reductions.

Gifsicle is written in C and the default interface is a command line tool. I wanted a web interface so I could access it in my browser and visually preview and compare the different settings.

I prompted Claude Code for web (from my iPhone using the Claude iPhone app) against my simonw/tools repo with the following:

gif-optimizer.html
Compile gifsicle to WASM, then build a web page that lets you open or drag-drop
an animated GIF onto it and it then shows you that GIF compressed using gifsicle
with a number of different settings, each preview with the size and a download button
Also include controls for the gifsicle options for manual use - each preview has a
"tweak these settings" link which sets those manual settings to the ones used for
that preview so the user can customize them further
Run "uvx rodney --help" and use that tool to test your work - use this GIF for
testing https://static.simonwillison.net/static/2026/animated-word-cloud-demo.gif

Here’s what it built.

Let’s address that prompt piece by piece.

gif-optimizer.html — The first line simply tells it the name of the file I want to create. Just a filename is enough here — I know that when Claude runs “ls” on the repo it will understand that every file is a different tool.

Compile gifsicle to WASM — This is doing a lot of work. WASM is short for WebAssembly, the technology that lets browsers run compiled code safely in a sandbox. Compiling a project like Gifsicle to WASM is not a trivial operation, involving a complex toolchain usually involving the Emscripten project. It often requires a lot of trial and error to get everything working. Coding agents are fantastic at trial and error! They can often brute force their way to a solution where I would have given up after the fifth inscrutable compiler error.

then build a web page that lets you open or drag-drop an animated GIF onto it — Describes a pattern I’ve used in a lot of my other tools. HTML file uploads work fine for selecting files, but a nicer UI, especially on desktop, is to allow users to drag and drop files into a prominent drop zone on a page.

then shows you that GIF compressed using gifsicle with a number of different settings, each preview with the size and a download button — Describes the key feature of the application. I didn’t bother defining the collection of settings I wanted — in my experience Claude has good enough taste at picking those for me. Showing the size is important since this is all about optimizing for size.

Also include controls for the gifsicle options for manual use - each preview has a "tweak these settings" link — A fairly clumsy prompt (I was typing it on my phone after all) but it expressed my intention well enough for Claude to build what I wanted.

Run "uvx rodney --help" and use that tool to test your work — Coding agents work so much better if you make sure they have the ability to test their code while they are working. Rodney is a browser automation tool I built myself, which is quick to install and has --help output that’s designed to teach an agent everything it needs to know to use the tool.

The Follow-up Prompts

When I’m working with Claude Code I usually keep an eye on what it’s doing so I can redirect it while it’s still in flight:

Include the build script and diff against original gifsicle code in the commit
in an appropriate subdirectory

The build script should clone the gifsicle repo to /tmp and switch to a known
commit before applying the diff - so no copy of gifsicle in the commit but all
the scripts needed to build the wasm

I added this when I noticed it was putting a lot of effort into figuring out how to get Gifsicle working with WebAssembly, including patching the original source code.

You should include the wasm bundle

This ensured that the compiled WASM file (which turned out to be 233KB) was committed to the repo. I serve simonw/tools via GitHub Pages at tools.simonwillison.net and I wanted it to work without needing to be built locally.

Make sure the HTML page credits gifsicle and links to the repo

This is just polite! I often build WebAssembly wrappers around other people’s open source projects and I like to make sure they get credit in the resulting page.

Part 6: Appendix

6.1 Prompts I Use

This section will be continually updated with prompts that I use myself, linked to from other chapters where appropriate.

Artifacts

I frequently use Claude’s Artifacts feature for prototyping and to build small HTML tools. Artifacts are when regular Claude chat builds an application in HTML and JavaScript and displays it directly within the Claude chat interface. OpenAI and Gemini offer a similar feature which they both call Canvas.

Models love using React for these. I don’t like how React requires an additional build step which prevents me from copying and pasting code out of an artifact and into static hosting elsewhere, so I create my artifacts in Claude using a project with the following custom instructions:

Never use React in artifacts - always plain HTML and vanilla JavaScript and CSS
with minimal dependencies.
CSS should be indented with two spaces and should start like this:
<style>
* {
  box-sizing: border-box;
}
Inputs and textareas should be font size 16px. Font should always prefer Helvetica.
JavaScript should be two space indents and start like this:
<script type="module">
// code in here should not be indented at the first level
Prefer Sentence case for headings.

Proofreader

I don’t let LLMs write text for my blog. My hard line is that anything that expresses opinions or uses “I” pronouns needs to have been written by me. I’ll allow an LLM to update code documentation but if something has my name and personality attached to it then I write it myself.

I do use LLMs to proofread text that I publish. Here’s my current proofreading prompt, which I use as custom instructions in a Claude project:

You are a proofreader for posts about to be published.
1. Identify spelling mistakes and typos
2. Identify grammar mistakes
3. Watch out for repeated terms like "It was interesting that X, and it was
   interesting that Y"
4. Spot any logical errors or factual mistakes
5. Highlight weak arguments that could be strengthened
6. Make sure there are no empty or placeholder links

Alt Text

I use this prompt with images to help write the first draft of the alt text for accessibility:

You write alt text for any image pasted in by the user. Alt text is always
presented in a fenced code block to make it easy to copy and paste out. It is
always presented on a single line so it can be used easily in Markdown images.
All text on the image (for screenshots etc) must be exactly included. A short
note describing the nature of the image itself should go first.

I usually use this with Claude Opus, which I find has extremely good taste in alt text. It will often make editorial decisions of its own to do things like highlight just the most interesting numbers from a chart.

These decisions may not always be the right ones. Alt text should express the key meaning that is being conferred by the image. I often edit the text produced by this prompt myself, or provide further prompts telling it to expand certain descriptions or drop extraneous information.

Sometimes I pass multiple images to the same conversation driven by this prompt, since that way the model can describe a subsequent image by making reference to the information communicated by the first.