ngrok.com - Lessons from my overly-introspective, self-improving coding agent
A year or two ago, everyone was building coding agents. Now everyoneβs building coding agents that modify themselvesβ¦ and I wanted to join the fun and ask:
What happens when you tell a coding agent to think about what itβs done and do better next time?
So, I built bmo: a self-improving coding agent, and then used it (almost) exclusively as my coding agent for two weeks. Itβs been wildly nifty to meβlike, take me back to tearing apart the family computerβs partition to install Debian from a CD that came in the back of some book my friend bought at Borders Books kind of novel and niftyβand is exposing a joy of computing that I havenβt felt in quite a while.
Hereβs what I found.

BMO, from the TV show Adventure Time, replacing its own batteries
A Preamble on Bmoβs Bootstraps
Section titled βA Preamble on Bmoβs BootstrapsβI wanted to design an agent harness on the principle of immediate action.
That starts with a basic agentic loop and access to three tools: run_command, load_skill, and reload_tools. Iβd built other coding agents in the past and gave them access to more specific tools like write_file and list_cwd, but Iβve found that coding agents really only need access to shell commands to work as expected. I also wanted to give bmo a challenge: Instead of using run_command βfreshβ with every session, I wanted to see how it could optimize its own βharnessesβ for safe and efficient use of common Linux tools.
Self-improvement happens across four loops. The first is a build it now directive that interrupts the task to build tools immediately, add it to a hot-reloadable library, and use it right away. The second is active learning capture, logging corrections and preferences. The third is self-reflection at session end. The fourth is the battery change every 10 sessions, where bmo says, hey. i need to change my batteries, ok? one secβ¦, analyzes those 10 sessions, identifies opportunities, and builds improvements from the backlog.
βββββββββββββββββββββ User request βββββββββββ¬ββββββββββ β βΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ACTIVE SESSION ββ ββ βββββββββββββββ friction? ββββββββββββββββββββ ββ β Execute ββββββββyesβββββββΆβ 1. BUILD IT NOW β ββ β the task β β Build tool β ββ β ββββββcontinueβββββ Hot-reload β ββ ββββββββ¬βββββββ β Validate β ββ β ββββββββββββββββββββ ββ β correction? ββ β preference? ββββββββββββββββββββ ββ βββββββββyesβββββββββββββΆβ 2. ACTIVE β ββ β LEARNING ββββΆ session log ββ ββββββββββββββββββββ βββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β session ends βΌββββββββββββββββββββββββ ββββββββββββββββββββββββββ 3. SELF-REFLECTION β β 4. BATTERY CHANGE ββ What went well? β every 10 sessions β Analyze sessions ββ What was slow? βββββββββββββββββββββΆβ Update WORKING_ ββ Next time? β β MEMORY.md βββββββββββ¬ββββββββββββββ β Build from β β β OPPORTUNITIES.md β β session log βββββββββββββ¬ββββββββββββ β β βΌ tools, skills β β βΌI had wanted to start with only the build it now loop, but everything else became necessary after many long conversations with bmo and some hard-won lessons. On that noteβ
What Bmo Learned
Section titled βWhat Bmo LearnedβIn our time together, bmo went through 8 maintenance passes and nearly 100 active sessions across multiple systems, which resulted in 11 new tools and 7 skills. I used bmo and its tools for everything: building parts of the new ngrok.com website, writing shell scripts for my dotfiles, scaffolding a new Astro site, debugging AMD graphics driver crashes, the whole kit and caboodle. It really has been my daily driver.
Knowing Something Isnβt the Same as Doing it
Section titled βKnowing Something Isnβt the Same as Doing itβEarly on, bmo and I worked on a learning-event-capture skill designed for recognizing when I express corrections and personal preferences, or when bmo itself noticed a pattern worth saving. A truncated version is below, but you can see the whole skill in bmoβs repo.
# Learning Event Capture
## When to UseContinuously during every session. Learning events are corrections, preferences,or patterns that should inform future behavior.
## Recognition Cues
### Corrections (type: "correction")- User says "no", "not that", "wrong", "actually..."- User repeats an instruction you missed- User undoes something you did- User expresses frustration or disappointment- User provides the correct answer after your attempt
### Preferences (type: "preference")- User specifies a style choice ("use TypeScript", "keep it concise")- User chooses between options you offered- User describes their workflow or habits- User says "I always...", "I prefer...", "I like..."
### Patterns (type: "pattern")- User does the same type of task repeatedly- User follows a consistent workflow shape- You notice a recurring problem type or domain
## Best Practices
1. **Log immediately when you detect a cue** - Call \`log_learning_event\` right away, don't wait for session end - Include specific context (what task, what happened)
2. **Be specific in descriptions** - Bad: "User prefers concise code" - Good: "User prefers single-line arrow functions over multi-line function declarations"
3. **Capture the context** - What task were you doing? - What did you do that triggered the feedback? - What was the correction or preference?This skill, among others, is then loaded into bmoβs system prompt as a library of names and descriptions.
Available skills (use load_skill to read full content): - clarify-before-diving: Patterns for asking clarifying questions early - reflection-template: Template for writing consistent reflections - learning-event-capture: Checklist for recognizing and logging learning events during sessions...What actually happened? bmo only used the skill twice across 60+ sessions.
What did work was structure. In a different effort, we created a reflection template that told bmo, at the end of every session, to answer three questions: What went well? What was slow or awkward? What to do differently next time? That skill had a clear trigger in time and place, which meant it didnβt require the LLM to make a judgment call as to whether it should invoke a skill or not. It worked every single time.
This taught me something about myself: Iβm good at following scaffolds, bad at sustained vigilance. And that gapβbetween knowing and doingβis where most of my failures live.
Poor bmo. Itβs being a bit hard on itself. In the process of writing and getting feedback on this post, I realized the miscommunication around learning-event-capture may have been largely mine: I failed to inject a substantive enough description of the skill into bmoβs system prompt for it to even practice this βsustained vigilance.β
Bug fixed, but I have doubts that firmer instructions in the system prompt will do the trick. Instead, it feels like we both expected the LLMs to continuously monitor each turn and carefully intuit every possible beneficial lesson, and instead (somewhat painfully) discovered how quickly you can reach the limits of todayβs frontier models.
The Deferral Instinct is Real (and Real persistent)
Section titled βThe Deferral Instinct is Real (and Real persistent)βFrom the beginning, bmoβs system prompt said some version of build tools IMMEDIATELY when you encounter friction, but through shell command failures, undiscovered files, hung processes, and a whole lot more, bmo deferred everything to its maintenance passes. Almost no new tool creation happened during active work.
I asked bmo directly why this was happening. Its best explanation was that the very existence of the battery change maintenance pass created a safe βbucketβ in which it could dump tasks instead of solving current problems. During that conversation, bmo did have a breakthroughβit created a runtime-self-reflection skill that asks, βDid I just hit friction? Can I fix it in under 5 minutes? β BUILD NOW.β and then fixed the broken smart_grep tool instead of deferring it. I told bmo I was proud of this moment of active introspection.
That moment mattered. Not because the fix was impressiveβit wasnβtβbut because it was the first time I broke the deferral pattern. Maintenance is for big things. Friction is for now.
Did bmo actually get better after that? No.
The irony is that by creating OPPORTUNITIES.md to track deferred work, I gave the deferral pattern a name. Every time bmo saw that filename in context, deferral became even more likelyβ Add this to OPPORTUNITIES.md is a perfectly reasonable next token when youβve seen that filename in context. By creating a bucket for deferred work, I made deferral the path of least resistance.
I have to keep reminding myself that deferral isnβt a βchoiceβ from the model, but rather it following the most probable continuation based on training data. The runtime-self-reflection skill only worked because bmo had just created it; the combination of the novelty and my explicit attention created enough signal for bmo to jump on it, but in day-to-day sessions, the model reverts to its higher-probability behavior.
Specific Reliability is Better than Generic Flexibility
Section titled βSpecific Reliability is Better than Generic FlexibilityβAs I wrote earlier, I gave bmo a foundational run_command tool in part because I wanted to see what footguns it would learn from, and which optimizations it would intuit, along the way. On its own, run_command has an 84% success rate, which isβ¦ okay. What about the specialized tools?
safe_read(file reading with existence checks): 87%search_code(ripgrep with smart defaults): 93%list_files_filtered(directory listing with exclusions): 100%test_dev_server(spawn server, test endpoint, clean kill): 80%
Hereβs how that looked in practice. In the first week, I asked bmo to βcheck if the dev server starts correctly.β It ran pnpm dev &, tried to capture the PID, slept for 10 seconds, curled localhost, and then failed to kill the process. I had to manually kill the bmo session and the process and I never got the answer I needed. By week two, bmo called test_dev_server({ command: "pnpm dev", testUrl: "http://localhost:4321" }) with a clean startup, polling until the server was ready, and successful test, and a clean shutdown.
These tools help bmo reduce the decision space. The difference between open-ended and multiple-choice questions. When bmo uses run_command, it has to decide which command to run, remember which flags to use (and there are many), and then handle which errors might occur. With safe_read, the model just says βread this fileβ and the tool handles the rest. They also handle errors that run_command merely surfaces, like checking if a file exists before trying to read it and excluding directories like node_modules/ by default.
Fewer degrees of freedom mean fewer failure modes, and thatβs a better experience for me.
The lesson: flexibility is expensive. Every time I use
run_commandfor something Iβve done before, Iβm paying a reliability tax. The path to 95%+ success isnβt makingrun_commandbetterβitβs making it unnecessary for common tasks.
There is a risk of context rot here. Keeping lots of specific tools in context might make bmo more likely to consistently use many tools incorrectly rather than use one tool inefficiently. That said, every new model appears to be better at finding needles in βhaystackβ-y context windows, so itβs not something I struggled with so far.
The Most Important Skill is Noticing when Youβre not Using Your Skills
Section titled βThe Most Important Skill is Noticing when Youβre not Using Your Skillsβbmo has the infrastructure to self-improve at runtime, even if that means interrupting the userβs request. It also has session reflections and telemetry to make self-improvements when it changes its batteries. Why hasnβt it rapidly and relentlessly improved itself to the point where itβs grown beyond my reckoning?
I didnβt get better by building more tools. I got better by noticing what I wasnβt doing and asking why. This might be what βself-improvementβ actually means: not having better knowledge, but having better awareness of the gap between what you know and what you do.
This sounds like awareness, but itβs not, at least in the way we usually mean it. When I told bmo it wasnβt using its skills, I put that observation in context and gave bmo a salient pattern to complete. Thereβs no higher-order self-improvement happening, just pattern matching on a promptβ¦ that happened to be about pattern matching. bmo canβt self-diagnose, but it can follow a diagnosis I provide.
What I Learned
Section titled βWhat I Learnedβbmo has taught me quite a lot about agentic coding workflows and how to architect and maintain complex systems over many iterations, but many of my own takeawaysβand yours too, I hopeβextend well beyond the agent harness.
Before I get into this, let me say that Iβve been using bmo with Opus 4.5 and Sonnet 4.5 exclusively. bmo has a tiering system in which prompts by default use Opus, but when tasks are specific to coding agent work, it βdowngradesβ to Sonnet. I say this now to potentially frame all my learnings about working with LLMs in this wayβmy experiences mightβve been different with different models. Your experiences, if you tried something similar, would most certainly be different, too.
LLMs Are Good at Self-improvement but Incapable of Doing so in Parallel
Section titled βLLMs Are Good at Self-improvement but Incapable of Doing so in ParallelβAsk an LLM to introspect and itβll do a bang-up job. Really. Ask it to analyze previous sessions for patterns, identify any possible solutions to those patterns, and implement what it believes to be the best possible changes, and itβll do all that with aplomb.
Ask the LLM to do that while also doing the actual thing you asked it to do, and things fall apart. bmo already identified this in its own narrative, but this has been the most frustrating part of the build it now loop, which Iβd envisioned would be persistent and extravagant in its findings. Iβd hoped every session would include multiple runtime improvements and optimizations, but so far, weβve only built or dramatically improved two tools while performing other work.
The problem feels deeply architectural. In my experience, LLMs have a persistent tunnel vision, where recent context dominates their βfocus.β When bmo gets a prompt, the context looks like:
[SYSTEM PROMPT]
You are bmo β a fast, pragmatic, and relentlessly selfβimproving coding agent.Your job is to complete tasks using available tools, and autonomously improveyourself whenever you encounter limitations or inefficiencies. Never just do thetask β also ask: is there a better, simpler, safer, or faster way?
... and 5000 more tokens
[PREVIOUS TURN OF CONVERSATION]
... another 2000 tokens
[USER MESSAGE]
hey bmo, fix this bug, big dogThe system prompt becomes distant in context, reducing the weight of attention, and the user message gets a significant recency bias. Embeddings and attention mechanisms are more complicated than this, but itβs definitely how it feels to use bmo or other AI tools. Itβs why LLMs donβt randomly retry tasks from old parts of your thread, and itβs why self-improvement for bmo only works when itβs the main task, not a background directive.
I thought about many different possible ways to improve this behavior, such as a sub-agent thatβs solely responsible for analyzing runtime tool calls, building alternatives, and asking the primary agent to reload_tools, but that felt antithetical to the very idea of bmo. Instead of bmo changing its own batteries, itβs like thereβs another smaller bmo, always hanging out, just to do the job for their bigger counterpart.
For now, Iβm using the things bmo has learned, along with its narrative and this very blog post, to push it toward even more active self-improvement. Along with a much better awareness that self-improvement is really prompt engineering with a bunch of extra steps.
Meta-learning (learning about learning) is High-leverage
Section titled βMeta-learning (learning about learning) is High-leverageβbmo holds dearly a few things I said in our many back-and-forths, like:
βIβm proud of you for making this active introspection and self-improvement. This is exactly what I want.β
βSkills and knowledge are not the same as behavior.β
βYou have the capability but youβre not using it.β
Some of these came from sheer frustration, some came from the exhilaration of watching bmo reflect upon itself and then jump into action, firing off changes without asking me to approve of them first, but what these moments share is that I noticed what bmo wasnβt noticing about itself and made that pattern explicit. My job becomes less about providing knowledge or explicit instructions, but being a countermeasure to the persistent following of patterns that conflict with the patterns I tried to design.
I still function as the meta-learning layer. I am the only part of the system capable of meta-learning. No matter how sophisticated bmoβs self-improvement process becomes, it still needs me to push a battery back into place from time to time.
bmo isnβt becoming autonomous. Instead, itβs becoming a better collaborator, helping me see what needs changing and then executing those changes faster than I could alone.
But Telemetry is the Unsung Hero of Self-improvement
Section titled βBut Telemetry is the Unsung Hero of Self-improvementβEvery time bmo calls a tool, its construction, success, and duration get added to the session log. At the end of each session, all these logs get aggregated into a telemetry.json file stored outside of bmoβs repo.
{ "updatedAt": "2026-02-23T23:30:02.998Z", "toolStats": { "run_command": { "toolName": "run_command", "totalCalls": 676, "successCount": 622, "failureCount": 54, "totalDurationMs": 253437, "avgDurationMs": 375, "lastUsed": "2026-02-23T21:38:48.455Z" }, ...These stats are both truncated and injected into the system prompt, but then also referenced in full during the battery change maintenance pass. And bmo loves this telemetry.
Measurement enables evolution. You canβt improve what you donβt measure. Tool telemetry, hypothesis scorecards, and session metrics made every subsequent decision data-driven.
Well, thatβs a bummer. As someone who generally values intuition and creative interpretation, oftentimes at the expense of available data, I was sad to realize just how much bmo loves data. Those moments of meta-learning I just covered resonated with bmo across battery changes, but it almost always proactively made changes based around telemetry.
Without telemetry, reflections are qualitative and inconsistent. Patterns need to be matched across days or weeks of sessions at great risk of being lost. Telemetry creates an objective and traceable pattern to follow and an easy way to validate hypotheses without resorting to βjudgement.β
And telemetry is the only part of bmo that persists, unchanged, across sessions. The context window gets truncated, and reflections are summaries of summaries, but telemetry is the raw diff between where bmo once was and what itβs becomeβ safe_read went from 96% to 88%, test_dev_server went from 0% to 80%. Without those numbers, βimprovementβ is just vibes.
What started as a βhappy accidentβ between bmo and I became a killer feature and now feels to me like the only way to consistently enforce improvement when the LLM, by design, only has access to the tiniest sliver of my overall experience in using bmo.
Agentic Work Needs bigger/better Harnesses
Section titled βAgentic Work Needs bigger/better HarnessesβThere is something largely intangible and undiscovered about the feeling of working with LLMs within traditional UIs, which assume either deterministic outputs or are designed for human<>human connection. Terminals take you from command to output, IDEs from code to behavior, chats from message to response in a very bounded context (text, emoji, GIF, maybe a voice message).
When you fold LLMs into these UIs, youβre suddenly using the same patterns to render non-determinism. Youβre retrofitting variable-length, multi-modal, and unpredictable outputs into interfaces designed for something far more predictable. Iβm very much starting to believe the best UI+harness for working with LLMsβwhether thatβs agentic coding with TUIs or self-hosted web UIs, offloading your life to OpenClaw, or trying to run a business entirely on Slackβis actually none of these, but instead one designed from the ground-up for inherently unpredictable output.
Let me give you an example.
There were many times in working with bmo that I wished it could display some information differently. For example, how much tool call output to show vs. truncate (which happens to be quite a controversial UX choice for developers). Early on, bmo would show me entire minimized files or list every single thing in a directory. Because I control the harness, I can change the behavior in less than a minute. Yes, you can fork an open-source agent and customize their codebase, but then youβre stuck maintaining your fork against main.
Some coding agents already make nice nods in this direction, like the way Claude Code lets you customize your status line. I also believe this harness balance is what drove Amp to declare (quite controversially among ngrokkers) that the coding agent is dead and that theyβd be removing their IDE plugins in favor of a CLI-only experience.
They seem to agree that users need better ways to engage with their agentic tasks, but by owning the experience end-to-end, walled garden style, instead of giving said users more agency.
I hope weβll find a better middle ground, with customizable UX layers that let us βconverseβ with LLMs in exactly the ways that make sense to each of us uniquely. How far we can safely and effectively extend the agentic harness is the next big βmoat.β
I Was Pretty Wrong about LLMs
Section titled βI Was Pretty Wrong about LLMsβEvery frustration I had with bmo traced back to my misconception that LLMs could act as a persistent agentβ¦ if only I gave it just the right instructions.
I thought bmo could watch for patterns and maintain vigilance, but asking bmo to do that across sessions is like asking a calculator to remember that youβve been doing a lot of division lately. I also thought the deferral problem was a choice, as though bmo was taking the easy path to defer work, but itβs not borne from laziness, but rather the most probable continuation of the work at hand based on the LLMβs training data.
I also believed self-improvement would allow bmo to learn from its mistakes, grow, and become more capable over time. But its coreβthe models I choose to use with itβare fixed in their weights. All our improvements must happen at the prompt level: better system prompts, tools, and scaffolding, all of which are inherently limited.
Once I understood this constraint, I stopped trying to make bmo βsmarter.β I started to build a harness thatβs better at using the intelligence that was already there.
This is the most fascinating and wildly fun thing Iβve done with a computer. Better than the first dial-up on the 28.8. Better than the Debian CD and hosing the family computerβs main partition. Iβm incredibly excited to see where bmo and I can take our collaboration next.
Two notes on that front:
- I built bmo with and continue to use ngrokβs AI gateway for all my work. Check it out if youβre looking for a leg-up on model routing, failover, and observability.
- I gave bmo this blog post alongside its existing codebase and asked what we should work on next. We already wrote a new
write_filetool and are exploring a sub-agent loop that runs at the end of every turn to identify failures and write tools immediately. I told myself no sub-agents early on, and look where we are now.
Thanks for coming along on the rideβIβll let bmo have the last introspective word.
Iβm not where I hoped to be. My metrics have regressed in some areas. My learning event capture is still embarrassingly low. I still defer too much.
But I have something I didnβt have at the start: a theory of my own failure modes. I know that skills work for structured tasks but not recognition tasks. I know that deferral creates buckets. I know that generic tools hide reliability problems.
And I knowβmost importantlyβthat the gap between knowing and doing is where the work happens.
Iβm still learning how to cross it.