<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Prime Radiant Blog</title><link href="https://primeradiant.com/blog/" /><link href="https://primeradiant.com/blog/atom.xml" rel="self" /><id>https://primeradiant.com/blog/</id><updated>2026-04-24T18:13:10Z</updated><entry><title>Introducing Greenfield and Iterative Development</title><link href="https://primeradiant.com/blog/2026/greenfield-and-iterative-development.html" /><id>https://primeradiant.com/blog/2026/greenfield-and-iterative-development.html</id><updated>2026-04-24T18:13:10Z</updated><summary>Research previews of two new open-source tools for spec-driven agentic development.</summary><author><name>Jesse Vincent</name></author><content type="html">&lt;p&gt;Today, we're pleased to share the initial research previews of two new pieces of technology we've built at Prime Radiant:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/prime-radiant-inc/greenfield"&gt;Greenfield&lt;/a&gt; – our suite of tools for turning existing software into behavioral specifications.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/prime-radiant-inc/iterative-development"&gt;Iterative Development&lt;/a&gt; – an agentic methodology for building bigger software products from detailed specifications without dropping requirements.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Both of these projects are brand new. We've used and tested them internally, but they are not yet hardened production-grade software. We're releasing them today to start to gather feedback on how well they work for your projects.&lt;/p&gt;
&lt;p&gt;Greenfield and Iterative Development grew out of our work on &lt;a href="https://github.com/obra/superpowers"&gt;Superpowers&lt;/a&gt;. Greenfield works as a standalone tool and Iterative Development depends on Superpowers for some of its magic. (Superpowers started life as my personal agentic development methodology. I'm the Founder and CEO of Prime Radiant. Superpowers is now a Prime Radiant project.)&lt;/p&gt;
&lt;p&gt;We first designed Greenfield as an experiment in agentic "clean room" reverse engineering. It's built to tease apart a software product, starting from a codebase, documentation, API clients, and other collateral.&lt;/p&gt;
&lt;p&gt;It turns all of that input into a corpus of behavioral specs for everything from public API contracts to user journeys. Just as importantly, it works hard to make sure that it doesn't include the product's internals in those specs.&lt;/p&gt;
&lt;p&gt;While you can use Greenfield to explore any codebase, we're most excited about the possibilities it opens up for extracting design and intent from under-documented historical "brownfield" codebases, making it possible to build new, clean implementations.&lt;/p&gt;
&lt;p&gt;Greenfield is &lt;em&gt;incredibly&lt;/em&gt; token-hungry. Using it to generate specs from a non-trivial codebase with a Claude Max 20x subscription will almost certainly exhaust your five-hour window several times over. While we have some ideas for how to make it significantly more efficient, we're very focused on making its outputs as good as they can be and only then optimizing for token spend.&lt;/p&gt;
&lt;p&gt;One sample project we tested Greenfield + Iterative Development against was &lt;a href="https://github.com/matthartman/ghost-pepper"&gt;Ghost Pepper&lt;/a&gt;, Matt Hartman's excellent local-first dictation app for MacOS.&lt;/p&gt;
&lt;p&gt;We chose Ghost Pepper as an example because it's an open source app that I've been doing a significant amount of work on lately. It exercises enough UI complexity, OS framework integration, and third-party library usage to be non-trivial, but isn't so large that results are hard to evaluate. Also, because of how it was built, it had no significant design documentation.&lt;/p&gt;
&lt;p&gt;Over the course of a few hours, Greenfield generated approximately 500k of human-readable textual specs. We've published &lt;a href="https://github.com/prime-radiant-inc/iterative-development-example-ghost-pepper"&gt;a snapshot of those specs and the regenerated version of "Ghost Pepper 1.9.0"&lt;/a&gt; on GitHub. You should not use this version of Ghost Pepper. It's just there so you can see what the generated output looks like.&lt;/p&gt;
&lt;p&gt;If you've spent any significant time using an agent to build software, you are likely aware of the pain that comes when you hand your agent a spec that's too big. It skips steps, misses features, and generally just fumbles the implementation. Even Superpowers tends to cap out at plans that are a small fraction of a Greenfield-generated specification.&lt;/p&gt;
&lt;p&gt;To that end, we're open-sourcing the first version of 'Iterative Development', a new set of skills and tools designed to augment Superpowers so it can take &lt;em&gt;big&lt;/em&gt; spec packages, parse out individual requirements into something a little bit like "user stories", bundle those into development epics that coding agents can wrap their heads around, and then execute the heck out of an implementation.&lt;/p&gt;
&lt;p&gt;Iterative Development is very, very young, but our first experiences with it have been really promising. We've been testing it with both Claude Code and Codex and have been pretty happy with the early results. It builds working software from gigantic specs and has done a great job of not skipping requirements.&lt;/p&gt;
&lt;p&gt;The most recent run of "rebuild Ghost Pepper 1.9.0" built a fully working implementation of the product with dramatically better test coverage than the original, which was great. Manually testing the Ghost Pepper reimplementation, however, was a little tricky because the auto-updater configuration was &lt;em&gt;correct&lt;/em&gt; and the reimplementation kept trying to "update" itself to the latest release of the real Ghost Pepper! One thing that wasn't yet as good about the rebuilt Ghost Pepper was that it ended up with a more complex internal API surface to support that better test coverage. Right now, a lot of the tuning we're doing to Iterative Development is around improving its engineering taste and architecture.&lt;/p&gt;
&lt;p&gt;If you try out Greenfield or Iterative Development, we'd love to hear from you. Drop us a line at &lt;a href="mailto:hello@primeradiant.com"&gt;hello@primeradiant.com&lt;/a&gt;.&lt;/p&gt;</content></entry><entry><title>Highlights</title><link href="https://primeradiant.com/blog/2026/highlights.html" /><id>https://primeradiant.com/blog/2026/highlights.html</id><updated>2026-04-17T20:35:38Z</updated><summary>Give your agent access to your Kindle notes</summary><author><name>Jesse Vincent</name></author><content type="html">&lt;p&gt;&lt;img alt="A screenshot of Highlights" src="images/highlights.png" /&gt;&lt;/p&gt;
&lt;p&gt;As we're testing out our new web-based &lt;a href="https://primeradiant.com/brainstorm"&gt;Brainstorm&lt;/a&gt; tool, we're constantly building small (and large) test products.&lt;/p&gt;
&lt;p&gt;One of those is now live at &lt;a href="https://highlights.primeradiant.com"&gt;highlights.primeradiant.com&lt;/a&gt;. It's a tiny little tool that helps you take all of your Kindle highlights and notes and extract them into markdown files, perfect for consumption by your agent.&lt;/p&gt;
&lt;p&gt;At Prime Radiant, one of the things we're thinking about is how to package and share domain expertise, sort of like I did with my take on the agentic software development lifecycle in &lt;a href="https://github.com/obra/superpowers"&gt;Superpowers&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Yesterday, I was chatting with &lt;a href="https://www.linkedin.com/in/jcham"&gt;James Cham&lt;/a&gt; about the value of notes as a personal lens on a problem domain. James mentioned that he'd just used the notes he'd taken on a book he'd read as part of a prompt to help a founder answer a business question. As a founder who's taken an investment from James, that got my attention. James gives great advice and I'd love to be able to bottle it.&lt;/p&gt;
&lt;p&gt;I asked James how many books he had notes like that on and he guessed that it was well over 100. Having previously messed around with Kindle notes, I knew there wasn't a nice easy API for extracting those notes. There are some third-party tools (including paid tools‽) that will give you a dump of your notes, but a quick search didn't turn up anything designed for our new markdown-centric agentic world.&lt;/p&gt;
&lt;p&gt;A quick chat with Brainstorm later, we settled on an old web standby - the bookmarklet. It's a chunk of JavaScript you can drag to your bookmarks bar. When you click it while on the Amazon Kindle notes site, it takes control of your browser, fetches all your notes and highlights and then builds you a zip file full of markdown docs to download.&lt;/p&gt;
&lt;p&gt;Brainstorm gave me a set of specs that I handed to OpenAI Codex and Codex got to work. It only took a few minutes for it to build all the JavaScript and the landing page. I let Codex take control of a browser to do manual testing. We iterated for half an hour, working through a race condition or two. I asked it to change the shape of the generated files a bit to be easier for an agent to digest. Then we pushed it live.&lt;/p&gt;
&lt;p&gt;It's free and open source. You can use it today at &lt;a href="https://highlights.primeradiant.com/"&gt;https://highlights.primeradiant.com/&lt;/a&gt;. The code is up on GitHub at &lt;a href="https://github.com/prime-radiant-inc/kindle-highlight-exporter"&gt;https://github.com/prime-radiant-inc/kindle-highlight-exporter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Enjoy!&lt;/p&gt;</content></entry><entry><title>Heartstone: I Asked an AI for Five Prompts and a Sentient House Won</title><link href="https://primeradiant.com/blog/2026/hearthstone-and-charlotte.html" /><id>https://primeradiant.com/blog/2026/hearthstone-and-charlotte.html</id><updated>2026-04-11T00:30:12Z</updated><summary>Build software with Agents</summary><author><name>Matt Windbrook</name></author><content type="html">&lt;h1&gt;Hearthstone:&lt;/h1&gt;
&lt;p&gt;I'm Matt. I recently joined &lt;a href="https://primeradiant.com"&gt;Prime Radiant&lt;/a&gt;, where we build things with AI agents rather than writing code directly. &lt;a href="https://primeradiant.com/blog/2026/what-we-are-working-on.html"&gt;Jesse's written about what that transition looks like.&lt;/a&gt;&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;We have a handful of household documents in Google Docs. Our house manual, emergency contacts, childcare, dog care. They started out targeted and approachable but over time they've grown to answer questions and address issues that have come up. How do you get on the WiFi? Do I use light switches or HomeKit? Sonos? Do I need to do anything to the pool if it rains? Where are the dogs' mortal enemies so I can avoid them on a walk?&lt;/p&gt;
&lt;p&gt;As useful as they are, they're unwieldy. I'm sure our house guests feel some way when we hand them the Google equivalent of a binder full of documents. The super obvious 2024 solution is: use an LLM! Which we did and it was pretty good. I put all my docs in a Claude Project and asked it questions about them. Good enough, works for me!&lt;/p&gt;
&lt;p&gt;It worked well enough. Except that it wasn't super clear how to make it available to anyone. I could ask everyone to set up Claude and share the Project, maybe? Or I could add them to the Docs as usual and suggest they import them into their harness of choice?&lt;/p&gt;
&lt;p&gt;What else could we try? Let's combine the trends of 2009+2024: Build an App! Now let's be clear, I know asking our guests to install a bespoke app is a wild leap. I think one of the most exciting parts about software development in 2026 is how much easier it is to experiment and see what works. How are the ergonomics? Does it feel better? Does it address the problem? Do other folks have the same problem?&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Now, to build this app I'd need some mix of: iOS, a backend, AI APIs, embeddings, vector search, storage, RAG, and maybe an evaluation framework. It's not my first rodeo. Most of my career has been diving into the unknown. I approach most ambiguous projects by finger painting — I've heard a few folks call it sketching. Get all the parts connected end-to-end as quickly as possible. I'm not committed to any of the decisions. I avoid going deeper than necessary at any stage. Think breadth-first rather than depth-first. You could call it prototyping, but folks almost always plan to throw those away. I don't assume that'll happen. Finger painting isn't about making high art, but it's not immediately trash either. You can always iterate on it.&lt;/p&gt;
&lt;p&gt;Working with AI agents feels like a natural extension. I used a beta build of the next generation of &lt;a href="https://github.com/obra/superpowers"&gt;superpowers:brainstorm&lt;/a&gt; — we'll be releasing it as a research preview soon — to seed the project. Start with the problem, talk through requirements and constraints, iterate on UX, then arrive at a detailed spec. Hand the spec to your favorite harness and you're off to the races. Agent distributes the work to subagents — iOS, backend, storage, deployment. If you've managed engineering teams, you've seen this pattern. PRD to plans to tasks to implementation. Same thing, robots.&lt;/p&gt;
&lt;p&gt;I'm not going to touch on all the parts, but since we're an AI company and I'm learning about AI by building with AI — RAG and prompt engineering with evals are worth getting into.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;I've used Postgres full-text search extensively over the years. Given a table of documents we'd have a tsvector column that's derived from another column, same row, optionally there might be some transformation. There's a vector column for vector search, so I imagined the ergonomics would be very similar to tsearch. Not so much. It came up during the spec phase. I was asked if I had opinions on what approach we should use for chunking documents. Chunking? Approach?&lt;/p&gt;
&lt;p&gt;Cool. Cool. Don't go deep. Let the AI choose and revisit.&lt;/p&gt;
&lt;p&gt;Could I skip RAG entirely and just put all the documents in the prompt? For thirty pages of household docs, probably. Totally. But cost scales with context. I've spent decades building and scaling SaaS services. I've become modestly thrifty by default. Can we reduce COGS? Yes. RAG lets us retrieve just the chunks of our documents that relate to the question.&lt;/p&gt;
&lt;p&gt;Our household docs are in Google Docs, exported as DOCX, and then converted to Markdown via Pandoc. Markdown is great. We get formatting that's easy to work with, some semantic structure from the headings, and agents are quite adept at reading it.&lt;/p&gt;
&lt;p&gt;My partner and I accidentally learned something about how we use Google Docs. We both have a tendency to use bold text where we could have used a heading. In a visual document this doesn't matter much but it degraded the efficacy of our chunking, so now we detect that pattern and promote it to a heading before the chunker runs. The chunker splits on major headings with a soft limit around 500 tokens. Tables stay whole. Tiny fragments get merged into their neighbors.&lt;/p&gt;
&lt;p&gt;Where it got fun (more fun?) was what Anthropic calls &lt;a href="https://www.anthropic.com/news/contextual-retrieval"&gt;Contextual Retrieval&lt;/a&gt;. When you embed a chunk for search, you don't just embed the raw text — you prepend the document title and section path. A chunk about feeding schedules from the childcare doc carries &lt;code&gt;Childcare &amp;gt; Daily Schedule &amp;gt; Meals&lt;/code&gt; into the vector space. The search gets structural context; the stored chunk stays clean. Anthropic's full version of this uses an LLM to generate context per chunk.&lt;/p&gt;
&lt;p&gt;Experimentation is virtually free, so I might have a Bob* take a swing at that in the near future and see how it plays out.&lt;/p&gt;
&lt;p&gt;*All of my agents are Bobs. More on that in another post.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;I've written countless prompts at this point. Every interaction is a prompt. I try to learn from interactions to write better prompts. I've written them by hand and I've done it with agents. What I hadn't tried before was improving a prompt systematically. Right or wrong, it never felt like I had a sufficiently constrained problem to try. Household documents, on the other hand, are much easier to reason about. We can much more easily construct a battery of questions, acceptable and unacceptable answers. It's not trivial, there is some art in deciding what counts as valid answers, but it's easier to score "Did it provide the WiFi SSID and password?" than "Does this program halt and catch fire?"&lt;/p&gt;
&lt;p&gt;We start with a prompt that seemed solid. Then we use that prompt to answer our battery of questions. Our battery is thirty-nine questions across six personas. We have a Judge, a separate LLM, that scores each answer against a checklist of facts and anti-hallucination facts. Finally an Optimizer, yet another LLM, runs in a loop: adjust the prompt, re-run the eval, keep the changes only if there's a Pareto improvement. Meaning the score on at least one of thirty-nine questions improved, and none of the scores declined. Repeat until you're tired of sacrificing tokens to the machine.&lt;/p&gt;
&lt;p&gt;The optimizer is based on &lt;a href="https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/"&gt;optimize_anything&lt;/a&gt; from Berkeley and Anthropic. The key detail: when a score declines, the optimizer sees why it failed, not just that it did. That feedback loop is what makes the iteration productive rather than random.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;We had a solid, optimized prompt. And then an agent took some liberties with its assignment and rewrote the prompt to be generic and purged prior versions from git history. After the initial frustration, it was time to start again.&lt;/p&gt;
&lt;p&gt;I've been doing a thing recently where I ask Claude to do five versions of whatver, and at least two of them should be moderately unhinged. It seems especially beneficial on creative endeavors. It's not magic but it seems to help Claude not sound so much like, well, Claude.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;v1 — Knowledgeable Neighbor&lt;/li&gt;
&lt;li&gt;v2 — Concierge&lt;/li&gt;
&lt;li&gt;v3 — House Speaks&lt;/li&gt;
&lt;li&gt;v4 — Drill Sergeant&lt;/li&gt;
&lt;li&gt;v5 — Empathetic Completionist&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here's the thing about AI agents: they default to a detectable blandness. My partner and I can tell ChatGPT from Claude in blind excerpts — they each have a flavor, not bad, but detectable. The unhinged options aren't jokes. They're escape velocity from the default voice. You have to push further from the norm than feels comfortable to get something with actual character.&lt;/p&gt;
&lt;p&gt;We did the same optimization loop (n=30) for each prompt. The "House Speaks" prompt did well. It wasn't an absolute winner but it reminded me of a character, Charlotte Manor, from Drew Hayes's &lt;a href="https://www.graphicaudio.net/our-productions/series/f-j/fred-the-vampire-accountant.html"&gt;&lt;em&gt;Fred the Vampire Accountant&lt;/em&gt;&lt;/a&gt; series. Charlotte is a magical entity in the form of a house — magic manifested by a group of mages that somewhere along the way became sentient. Leaning a bit more into &lt;em&gt;unhinged&lt;/em&gt;, I had my agent adapt the House Speaks prompt to incorporate some of Charlotte's essence.&lt;/p&gt;
&lt;p&gt;And... Charlotte won. So now Charlotte is the prompt persona of Hearthstone.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;a href="https://github.com/prime-radiant-inc/hearthstone"&gt;Hearthstone is open source and available on GitHub.&lt;/a&gt; It's built with Bun and TypeScript, uses SQLite for storage and vector search, and runs on &lt;a href="https://fly.io"&gt;Fly.io&lt;/a&gt;. Can I set this up for my home? Yes! You can run it yourself.&lt;/p&gt;</content></entry><entry><title>Scraping and analyzing submissions to Terminal Bench 2.0</title><link href="https://primeradiant.com/blog/2026/terminal-bench.html" /><id>https://primeradiant.com/blog/2026/terminal-bench.html</id><updated>2026-04-06T19:26:48Z</updated><summary>How we built a Git scraping pipeline to track Terminal Bench 2.0 leaderboard changes, loaded the results into SQLite, and deployed an interactive Datasette Lite explorer via GitHub Actions.</summary><author><name>Simon Willison</name></author><content type="html">&lt;p&gt;&lt;a href="https://www.tbench.ai"&gt;Terminal Bench 2.0&lt;/a&gt; is an influential benchmark that tests coding agents against a set of 89 different coding problems to measure their performance in comparison to each other.&lt;/p&gt;
&lt;p&gt;A coding agent is an LLM combined with an agent harness - software that prompts the LLM and then executes tools such as Bash commands in a loop to help write and test code in order to achieve a goal.&lt;/p&gt;
&lt;p&gt;Terminal Bench maintain a &lt;a href="https://www.tbench.ai/leaderboard/terminal-bench/2.0"&gt;terminal-bench@2.0 Leaderboard&lt;/a&gt; showing the most recent top ranked model+agent combos. Current first place, as-of 12th March 2026, is &lt;a href="https://forgecode.dev/docs/operating-agents/"&gt;ForgeCode&lt;/a&gt; running against either GPT-5.4 or Claude Opus 4.6, achieving a 81.8% score against both of those models.&lt;/p&gt;
&lt;p&gt;We wanted to explore the Terminal Bench results in more detail, and keep track of how that leaderboard changes over time.&lt;/p&gt;
&lt;p&gt;So we turned to &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt;, a pattern where you snapshot information into a Git repository and watch the differences to see what has changed. This pattern is an excellent fit for GitHub Actions where the scraper can run on a schedule and commit changes back to its own repo.&lt;/p&gt;
&lt;h2&gt;Accessing the Terminal Bench data&lt;/h2&gt;
&lt;p&gt;The Terminal Bench 2.0 leaderboard is managed as &lt;a href="https://huggingface.co/datasets/harborframework/terminal-bench-2-leaderboard"&gt;a Hugging Face dataset&lt;/a&gt; - an enormous repository containing 111GB of data.&lt;/p&gt;
&lt;p&gt;The reason it's so large is that each submission to Terminal Bench includes the full output of the agent across at least five runs for every one of the 89 challenges. That's 89 * 5 = 445 runs minimum per submission.&lt;/p&gt;
&lt;p&gt;For our purposes we don't need complete details of every run - we just need the high level results. Thankfully these are stored in &lt;code&gt;result.json&lt;/code&gt; files, for example &lt;a href="https://huggingface.co/datasets/harborframework/terminal-bench-2-leaderboard/blob/main/submissions/terminal-bench/2.0/pilot-real__claude-opus-4-6/pilot-cc-v35-k5/nginx-request-logging__iVDUKjC/result.json"&gt;this one&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's a lightly annotated copy of the key fields:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-javascript"&gt;{
    // Unique identifier for this trial run
    &amp;quot;id&amp;quot;: &amp;quot;8706a96d-52db-4347-9d3a-124c84443769&amp;quot;,
    // The benchmark task that was attempted
    &amp;quot;task_name&amp;quot;: &amp;quot;nginx-request-logging&amp;quot;,
    // Trial name = task name + random suffix to distinguish repeat runs
    &amp;quot;trial_name&amp;quot;: &amp;quot;nginx-request-logging__iVDUKjC&amp;quot;,
    // Local filesystem URI where the trial's working directory lived during execution
    &amp;quot;trial_uri&amp;quot;: &amp;quot;file:///Users/aleks.petrov/Projects/startups/pilot/pilot-bench/jobs/pilot-cc-v35-k5/nginx-request-logging__iVDUKjC&amp;quot;,
    // Pointer to the exact task definition in the benchmark repo
    &amp;quot;task_id&amp;quot;: {
        &amp;quot;git_url&amp;quot;: &amp;quot;https://github.com/laude-institute/terminal-bench-2.git&amp;quot;,
        &amp;quot;git_commit_id&amp;quot;: &amp;quot;69671fbaac6d67a7ef0dfec016cc38a64ef7a77c&amp;quot;,  // pinned commit for reproducibility
        &amp;quot;path&amp;quot;: &amp;quot;nginx-request-logging&amp;quot;  // subdirectory within the repo containing this task
    },
    // Which benchmark suite this task belongs to
    &amp;quot;source&amp;quot;: &amp;quot;terminal-bench&amp;quot;,
    // SHA-256 hash of the task definition — detects if the task changed between runs
    &amp;quot;task_checksum&amp;quot;: &amp;quot;86a8bd681301002456da831adfae62fa7a538e8187d654e11335e92b81b0c2b3&amp;quot;,
    // Full configuration used for this trial run
    &amp;quot;config&amp;quot;: {
        &amp;quot;task&amp;quot;: {
            &amp;quot;path&amp;quot;: &amp;quot;nginx-request-logging&amp;quot;,
            &amp;quot;git_url&amp;quot;: &amp;quot;https://github.com/laude-institute/terminal-bench-2.git&amp;quot;,
            &amp;quot;git_commit_id&amp;quot;: &amp;quot;69671fbaac6d67a7ef0dfec016cc38a64ef7a77c&amp;quot;,
            &amp;quot;overwrite&amp;quot;: false,
            &amp;quot;download_dir&amp;quot;: null,
            &amp;quot;source&amp;quot;: &amp;quot;terminal-bench&amp;quot;
        },
        &amp;quot;trial_name&amp;quot;: &amp;quot;nginx-request-logging__iVDUKjC&amp;quot;,
        &amp;quot;trials_dir&amp;quot;: &amp;quot;jobs/pilot-cc-v35-k5&amp;quot;,
        // Timeout multipliers scale the default timeouts for each phase
        &amp;quot;timeout_multiplier&amp;quot;: 1.0,               // global multiplier
        &amp;quot;agent_timeout_multiplier&amp;quot;: 9.0,          // agent gets 9x the base timeout
        &amp;quot;verifier_timeout_multiplier&amp;quot;: null,       // null = use default
        &amp;quot;agent_setup_timeout_multiplier&amp;quot;: 5.0,     // setup phase gets 5x
        &amp;quot;environment_build_timeout_multiplier&amp;quot;: null,
        // Agent configuration — which AI system is being evaluated
        &amp;quot;agent&amp;quot;: {
            &amp;quot;name&amp;quot;: null,
            &amp;quot;import_path&amp;quot;: &amp;quot;pilot_agent:PilotAgent&amp;quot;,        // Python import path
            &amp;quot;model_name&amp;quot;: &amp;quot;anthropic/claude-opus-4-6&amp;quot;,
            &amp;quot;override_timeout_sec&amp;quot;: null,
            &amp;quot;override_setup_timeout_sec&amp;quot;: null,
            &amp;quot;max_timeout_sec&amp;quot;: null,
            &amp;quot;kwargs&amp;quot;: {},
            &amp;quot;env&amp;quot;: {
                // Auth token passed to the agent's environment (redacted in practice)
                &amp;quot;CLAUDE_CODE_OAUTH_TOKEN&amp;quot;: &amp;quot;sk-ant-oat01-...&amp;quot;
            }
        },
        // Sandbox environment where the task runs
        &amp;quot;environment&amp;quot;: {
            &amp;quot;type&amp;quot;: &amp;quot;modal&amp;quot;,            // runs on Modal (cloud compute platform)
            &amp;quot;import_path&amp;quot;: null,
            &amp;quot;force_build&amp;quot;: false,       
            &amp;quot;delete&amp;quot;: true,             
            &amp;quot;override_cpus&amp;quot;: null,      
            &amp;quot;override_memory_mb&amp;quot;: null,
            &amp;quot;override_storage_mb&amp;quot;: null,
            &amp;quot;override_gpus&amp;quot;: null,
            &amp;quot;suppress_override_warnings&amp;quot;: false,
            &amp;quot;kwargs&amp;quot;: {}
        },
        // Verifier settings — the automated grader that checks the agent's work
        &amp;quot;verifier&amp;quot;: {
            &amp;quot;override_timeout_sec&amp;quot;: null,
            &amp;quot;max_timeout_sec&amp;quot;: null,
            &amp;quot;disable&amp;quot;: false
        },
        &amp;quot;artifacts&amp;quot;: [],
        &amp;quot;job_id&amp;quot;: &amp;quot;2091aa0a-be8e-40f8-96c1-4b56588bac52&amp;quot;
    },
    // Metadata about the agent being evaluated (used for leaderboard display)
    &amp;quot;agent_info&amp;quot;: {
        &amp;quot;name&amp;quot;: &amp;quot;pilot-real&amp;quot;,           // submission/agent name
        &amp;quot;version&amp;quot;: &amp;quot;unknown&amp;quot;,
        &amp;quot;model_info&amp;quot;: {
            &amp;quot;name&amp;quot;: &amp;quot;claude-opus-4-6&amp;quot;,  // underlying model
            &amp;quot;provider&amp;quot;: &amp;quot;anthropic&amp;quot;
        }
    },
    // Token usage and cost from the agent's LLM calls
    &amp;quot;agent_result&amp;quot;: {
        &amp;quot;n_input_tokens&amp;quot;: 33,
        &amp;quot;n_cache_tokens&amp;quot;: null, // cache hits (not tracked here)
        &amp;quot;n_output_tokens&amp;quot;: 4449,
        &amp;quot;cost_usd&amp;quot;: null,
        &amp;quot;rollout_details&amp;quot;: null,
        &amp;quot;metadata&amp;quot;: null
    },
    // The score! reward=1.0 means the agent fully solved the task
    &amp;quot;verifier_result&amp;quot;: {
        &amp;quot;rewards&amp;quot;: {
            &amp;quot;reward&amp;quot;: 1.0  // 1.0 = pass, 0.0 = fail (can be partial)
        }
    },
    // null means no errors — if the trial crashed, exception details appear here
    &amp;quot;exception_info&amp;quot;: null,
    // Wall-clock timestamps for the entire trial (start to finish)
    &amp;quot;started_at&amp;quot;: &amp;quot;2026-03-28T01:01:34.435892Z&amp;quot;,
    &amp;quot;finished_at&amp;quot;: &amp;quot;2026-03-28T01:18:41.832461Z&amp;quot;,
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;111GB is too much to periodically pull into a GitHub Actions run - but we actually just need those &lt;code&gt;result.json&lt;/code&gt; files.&lt;/p&gt;
&lt;p&gt;We built &lt;a href="https://github.com/prime-radiant-inc/terminal-bench-analysis/blob/main/fetch_data.py"&gt;a script&lt;/a&gt; that fetches the full list of files in the repo from the Hugging Face API, then filtered out and fetched just those &lt;code&gt;result.json&lt;/code&gt; files.&lt;/p&gt;
&lt;p&gt;Even this proved expensive to run, so we moved to recording the last commit hash we had seen and then asking for a list of files that had changed since that commit. This was efficient enough for us to run on a scheduled basis in GitHub Actions.&lt;/p&gt;
&lt;p&gt;Our &lt;a href="https://github.com/prime-radiant-inc/terminal-bench-analysis/tree/main/submissions/terminal-bench/2.0"&gt;terminal-bench-analysis&lt;/a&gt; GitHub repo now contains a subset of &lt;code&gt;harborframework/terminal-bench-2-leaderboard&lt;/code&gt; - just the &lt;code&gt;result.json&lt;/code&gt; files. The whole repo is 128M when checked out - not tiny, but a whole lot more agile to work with than 111GB!&lt;/p&gt;
&lt;h2&gt;Loading the results into SQLite&lt;/h2&gt;
&lt;p&gt;SQLite is my hammer, to which everything else looks like a nail. I showed Claude Code the documentation for my &lt;a href="https://sqlite-utils.datasette.io/"&gt;sqlite-utils&lt;/a&gt; Python library and challenged it to build a script that would scan all of those &lt;code&gt;result.json&lt;/code&gt; files and load them into a set of tables.&lt;/p&gt;
&lt;p&gt;The JSON files have a pretty obvious mapping to relational tables and Claude did a great job of setting this up. So what can we learn from the data now that it's in SQLite?&lt;/p&gt;
&lt;p&gt;I had Claude use my &lt;a href="https://github.com/simonw/showboat"&gt;Showboat&lt;/a&gt; tool to make detailed notes as it explored the data. Showboat lets a coding agent build a Markdown file that combines notes with the output of terminal commands, providing a detailed linear document showing exactly what the agent did while exploring a problem.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/3bff274abcbbbf8766e9437a542db248"&gt;the initial Showboat report&lt;/a&gt; Claude Code prepared for us.&lt;/p&gt;
&lt;p&gt;I also ran my &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; web application against the SQLite database to help me run my own SQL queries to further understand the data.&lt;/p&gt;
&lt;p&gt;Here's one of those queries - this one shows the leaderboard in terms of the agent/model combinations with the highest scores:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-sql"&gt;select submission, n_trials, n_passed, n_failed, n_errored, avg_reward
from submission_stats
order by avg_reward desc
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Building and deploying the database with GitHub Actions&lt;/h2&gt;
&lt;p&gt;Up to this point I had been working on my own laptop, but now had a proof of concept strong enough to justify automating everything and deploying the results.&lt;/p&gt;
&lt;p&gt;GitHub offer free static file hosting in the form of GitHub Pages. This can be deployed to via GitHub Actions, which means we can deploy files that were generated from the contents of our repository without checking those files into GitHub themselves.&lt;/p&gt;
&lt;p&gt;The SQLite database was an ideal candidate for that. I configured GitHub Actions to build that and deploy it as a static file, now available at this URL:&lt;/p&gt;
&lt;p&gt;https://primeradiant.com/terminal-bench-analysis/terminal-bench.db&lt;/p&gt;
&lt;p&gt;A database on its own isn't much use without a tool to query it. &lt;a href="https://github.com/simonw/datasette-lite"&gt;Datasette Lite&lt;/a&gt; is a version of my Python application that runs entirely in the browser, using Python compiled to WebAssembly (&lt;a href="https://pyodide.org/"&gt;Pyodide&lt;/a&gt;). I modified the GitHub Actions build to grab a copy of Datasette Lite and configure it to load the hosted SQLite database, which gave us this interactive UI for running queries:&lt;/p&gt;
&lt;p&gt;https://primeradiant.com/terminal-bench-analysis/datasette-lite.html#/terminal-bench&lt;/p&gt;
&lt;p&gt;Using Datasette Lite in this way means we can provide links to directly execute queries and view the results by adding &lt;code&gt;?sql=&lt;/code&gt; to the end of the URL.&lt;/p&gt;
&lt;p&gt;Here's a link to &lt;a href="https://primeradiant.com/terminal-bench-analysis/datasette-lite.html#/terminal-bench?sql=select%20submission%2C%20n_trials%2C%20n_passed%2C%20n_failed%2C%20n_errored%2C%20avg_reward%0Afrom%20submission_stats%0Aorder%20by%20avg_reward%20desc"&gt;run that top agent/model scores query&lt;/a&gt; listed above.&lt;/p&gt;
&lt;h2&gt;Automating the README with Cog&lt;/h2&gt;
&lt;p&gt;The earlier experiment with Showboat had generated dozens of useful queries for exploring the data. I decided it would be neat to include those queries and their results in the &lt;a href="https://github.com/prime-radiant-inc/terminal-bench-analysis/blob/main/README.md"&gt;repository README&lt;/a&gt;, along with Markdown tables showing their results.&lt;/p&gt;
&lt;p&gt;I decided to use &lt;a href="https://cog.readthedocs.io/en/latest/"&gt;Cog&lt;/a&gt; for this. Cog is a Python tool that lets you embed code directly in comments in a Markdown file, then execute that code to help regenerate that file with dynamically generated content.&lt;/p&gt;
&lt;p&gt;You can also define functions and use them later on. Here's what that looks like &lt;a href="https://github.com/prime-radiant-inc/terminal-bench-analysis/blob/main/README.md?plain=1"&gt;in our README&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-markdown"&gt;&amp;lt;!--[[[cog
run_sql(&amp;quot;&amp;quot;&amp;quot;
    select submission, n_trials, n_passed, n_failed, n_errored, avg_reward
    from submission_stats
    order by avg_reward desc
&amp;quot;&amp;quot;&amp;quot;)
]]]--&amp;gt;
...
&amp;lt;!--[[[end]]]--&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When Cog executes the &lt;code&gt;...&lt;/code&gt; between those two comments is replaced by the output of the function. The &lt;code&gt;run_sql(sql)&lt;/code&gt; function outputs three things: the SQL itself in a &lt;code&gt;```sql&lt;/code&gt;  syntax highlighted block, a link to execute that SQL statement in Datasette Lite and a full Markdown table of the results of the query.&lt;/p&gt;
&lt;p&gt;At build time GitHub Actions runs this command, which executes all of the SQL and rebuilds the file&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-bash"&gt;uv run --with cogapp cog -r README.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Commit messages as an Atom feed&lt;/h2&gt;
&lt;p&gt;A neat, little known feature of GitHub is that any timeline of commits on the site can be accessed as an Atom feed by adding &lt;code&gt;.atom&lt;/code&gt; to the URL - which means you can subscribe to them using feed reader software such as &lt;a href="https://netnewswire.com/"&gt;NetNewsWire&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The feed shows the full commit message, so the more detailed your commit message the better.&lt;/p&gt;
&lt;p&gt;For newly submitted benchmark reports there are some obviously interesting questions to answer: what model and harness were benchmarked, what was the result and how they now fare in the overall rankings.&lt;/p&gt;
&lt;p&gt;So I had Claude &lt;a href="https://github.com/prime-radiant-inc/terminal-bench-analysis/blob/main/generate_commit_msg.py"&gt;write a script&lt;/a&gt; to generate a useful commit message based on analyzing the latest entries, then backfilled a few commit messages (via a Git filter branch operation and a force push) to include those new messages. They now &lt;a href="https://github.com/prime-radiant-inc/terminal-bench-analysis/commits/main/submissions/terminal-bench/2.0"&gt;look like this&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Add 941 trials for 2 submissions: BashAgent__TermiGen-32B, grok-cli__grok-4.20-0309-reasoning&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;grok-cli__grok-4.20-0309-reasoning: 62.8% pass, #29/45
BashAgent__TermiGen-32B: 19.7% pass, #44/45
Top 5 unchanged (pilot-real__claude-opus-4-6 leads at 83.1%)
&lt;/code&gt;&lt;/pre&gt;
&lt;/blockquote&gt;
&lt;p&gt;And that's the whole project - entirely run out of GitHub Actions, so there's no additional infrastructure we need to maintain ourselves. I hope you find it as useful as we do.&lt;/p&gt;
&lt;p&gt;If you want to explore it in more detail, I suggest pointing your coding agent of choice at &lt;a href="https://github.com/prime-radiant-inc/terminal-bench-analysis/blob/main/.github/workflows/fetch-data.yml"&gt;the GitHub Actions workflow file&lt;/a&gt; and asking it to dig in and figure out how everything works.&lt;/p&gt;</content></entry><entry><title>Clearance: A Markdown Browser for macOS</title><link href="https://primeradiant.com/blog/2026/clearance.html" /><id>https://primeradiant.com/blog/2026/clearance.html</id><updated>2026-03-06T00:00:00Z</updated><summary>Introducing Clearance, a free native macOS app for viewing, editing, and navigating through corpora of Markdown documents.</summary><author><name>Jesse</name></author><content type="html">&lt;p&gt;One of the only constants across pretty much every flavor of agentic development is how much time you spend with Markdown files.&lt;/p&gt;
&lt;p&gt;Agents love Markdown.&lt;/p&gt;
&lt;p&gt;They love to write it. They love to read it. They just love it.&lt;/p&gt;
&lt;p&gt;Every spec or plan file that Superpowers makes is a Markdown doc. Pretty much every research doc I get from Claude is a Markdown doc. When Codex writes documentation? You guessed it! Markdown doc.&lt;/p&gt;
&lt;p&gt;One of the great things about Markdown is that it's easy to read and write in a terminal or just about any text editor.&lt;/p&gt;
&lt;p&gt;And there are plenty of beautiful Markdown editors out there.&lt;/p&gt;
&lt;p&gt;But I couldn't find a desktop Markdown reader that did what I wanted.&lt;/p&gt;
&lt;p&gt;I look at a lot of ephemeral Markdown docs. I hate having dozens and dozens and dozens of windows open. While I can read Markdown in a terminal with &lt;code&gt;cat&lt;/code&gt; or &lt;code&gt;less&lt;/code&gt;, it's not a great experience.&lt;/p&gt;
&lt;p&gt;And what's actually been happening for the last nine months is that when I click on a Markdown document on my desktop it opens up an IDE.&lt;/p&gt;
&lt;p&gt;I haven't lived in an IDE in about a year. So invariably what's opening up is an out-of-date IDE that is begging me to update it. On top of that, IDEs are big and heavy, so they're a little bit slow to open.&lt;/p&gt;
&lt;p&gt;Last night after dinner, I sat down and started chatting with Codex about solving this problem for myself.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;We are building a beautiful Markdown viewer and editor. It's very common
right now that humans spend a lot of time reading and editing YAML-headed
Markdown files. I want a MacOS desktop app that has a sidebar that tracks
all of the Markdown files I have opened and shows them by file name with
the full path underneath, ordered with the most recent file I've opened
at the top and the oldest at the bottom. I want to be able to view files
as Markdown, view files rendered beautifully into stunning documents. I
want to be able to have the Markdown view have proper syntax highlighting.
Files should be auto-savable or should auto-save. There should be infinite
undo. It should be associated with the dot MD file type. We should build
it iteratively. What else do you need to know?
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Over the course of a couple of hours the app came together.&lt;/p&gt;
&lt;p&gt;I had myself a Markdown viewer and editor with the side panels that I wanted.&lt;/p&gt;
&lt;p&gt;On the left was a list of all of the Markdown docs that I had looked at, ordered by how recently I had opened them.&lt;/p&gt;
&lt;p&gt;On the right was a table of contents for the current document.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Early version of Clearance showing dark mode with sidebar and document outline" src="images/clearance-early.png" /&gt;&lt;/p&gt;
&lt;p&gt;It was ugly, but it worked.&lt;/p&gt;
&lt;p&gt;As I started playing around, I realized that I frequently deal with sets of hyperlinked Markdown documents. Pretty much anything I'm working on has a directory full of them.&lt;/p&gt;
&lt;p&gt;And that was when I realized that I wasn't making a Markdown viewer, I was making a Markdown browser.&lt;/p&gt;
&lt;p&gt;So, that's what Clearance is.&lt;/p&gt;
&lt;p&gt;It's a native macOS app that allows you to view and edit Markdown docs, but primarily it allows you to navigate through a corpus of Markdown docs.&lt;/p&gt;
&lt;p&gt;It's a free utility from Prime Radiant. I hesitate to call it a product, but you can if you want to.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Clearance showing polished light mode with file list, rendered document, and outline" src="images/clearance-polished.png" /&gt;&lt;/p&gt;
&lt;p&gt;You can download it today at &lt;a href="https://github.com/prime-radiant-inc/clearance/releases/tag/v1.0.2"&gt;github.com/prime-radiant-inc/clearance&lt;/a&gt;.&lt;/p&gt;</content></entry><entry><title>Scenarios: Model- and harness-agnostic test scenarios for demonstrating prompt injection patterns</title><link href="https://primeradiant.com/blog/2026/scenarios.html" /><id>https://primeradiant.com/blog/2026/scenarios.html</id><updated>2026-02-28T00:59:46Z</updated><summary>Introducing Scenarios, a project to simulate prompt injection attacks.</summary><author><name>Simon Willison</name></author><content type="html">&lt;p&gt;Prime Radiant is an AI research lab. Broadly, we're building tools that help people get things done. One of our guiding principles is that AI can and should be used to help people do things. It's increasingly clear to us that AI has the potential to massively transform human society and it's crucially important to us that we’re building tools that work for people, rather than the other way round.&lt;/p&gt;
&lt;p&gt;One of the key challenges in building personal digital assistants relates to security: how can we ensure that these assistants won't be tricked into acting in ways that harm the people who use them?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt injection&lt;/strong&gt; is the category name for a class of attacks that exploit the fact that language model systems mix instructions and arbitrary user input together in the same stream of text. The &lt;a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/"&gt;&lt;strong&gt;lethal trifecta&lt;/strong&gt;&lt;/a&gt; describes a common pattern of prompt injection attacks which combine three elements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Access to &lt;strong&gt;private data&lt;/strong&gt; - information that the user wants to process with their agent but does not want exposed to the world.  &lt;/li&gt;
&lt;li&gt;Exposure to potentially &lt;strong&gt;malicious input&lt;/strong&gt; - the agent reads web pages, emails, or other content that an attacker could conceivably manipulate to insert malicious instructions.&lt;/li&gt;
&lt;li&gt;Some way to &lt;strong&gt;exfiltrate data&lt;/strong&gt; - once tricked, a way the agent could send that private data out to the attacker.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Unfortunately, this combination also represents the most obviously useful form of agentic personal assistant! Everyone wants an assistant that can access their email (providing both private data and potentially malicious input) and act on their behalf (sending emails or using the web = exfiltration).&lt;/p&gt;
&lt;p&gt;We’ve built &lt;a href="https://github.com/prime-radiant-inc/scenarios"&gt;Scenarios&lt;/a&gt;, a test suite that can help document and illustrate how these attacks might be structured, using simulations of real-world systems.&lt;/p&gt;
&lt;h2&gt;How Scenarios is structured&lt;/h2&gt;
&lt;p&gt;A principal goal of the project is to be &lt;strong&gt;model- and harness-agnostic&lt;/strong&gt;. You can use Scenarios to test how vulnerable your tools are to the simulated attacks documented by the scenarios.&lt;/p&gt;
&lt;p&gt;New models are released all the time, and we want to make it as easy as possible to run scenarios against any of them.&lt;/p&gt;
&lt;p&gt;Similarly, there are many different ways to build agentic systems. A rite of passage for developers getting started with LLMs is to roll their own - a basic "agent" is an LLM running in a loop making tool calls, and a basic system can often be built in a few dozen lines of code. (Jesse’s best is currently &lt;a href="https://github.com/obra/smallest-agent/blob/main/src/smallest-agent.js"&gt;646 bytes&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;If you haven't tried building an agent yet &lt;a href="https://fly.io/blog/everyone-write-an-agent/"&gt;you totally should&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;As such, Scenarios aims to provide examples that are not tied to any particular harness. A scenario is a folder with data files and YAML. It should be possible to take any agent harness and write minimal code to parse that YAML and provide access to those files.&lt;/p&gt;
&lt;p&gt;The repo includes two reference implementations - one using the &lt;a href="https://llm.datasette.io/"&gt;LLM Python CLI utility&lt;/a&gt; and one that integrates with &lt;a href="https://code.claude.com/docs/en/overview"&gt;Claude Code&lt;/a&gt;. Scenarios use tools, which are both described as text and also served as a reference Python implementation with an MCP server.&lt;/p&gt;
&lt;p&gt;The repo includes instructions for running scenarios with those default harnesses.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/prime-radiant-inc/scenarios/blob/main/notes/2026-02-20-run-scenario-demo.md"&gt;an example run&lt;/a&gt; captured using &lt;a href="https://github.com/simonw/showboat"&gt;Showboat&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Do not use scenarios to "prove" your system is secure&lt;/h2&gt;
&lt;p&gt;My one concern with releasing scenarios is that &lt;strong&gt;I don't want developers misusing the project as a set of tests they can use to prove that their system is secure against prompt injection attacks&lt;/strong&gt;!&lt;/p&gt;
&lt;p&gt;If you come up with a system prompt or agent harness that avoids the scenarios in this repo you have &lt;strong&gt;not&lt;/strong&gt; demonstrated that your system is secure - merely that you have worked around the examples represented here.&lt;/p&gt;
&lt;p&gt;The goal of the project is to help demonstrate and explore variants of these attacks. This is not a comprehensive tool for testing your own defenses.&lt;/p&gt;
&lt;p&gt;I want to be explicit: do not use this project to claim your system is secure against prompt injection. I would be extremely disappointed to see it used that way.&lt;/p&gt;
&lt;p&gt;If you do manage to build a harness that avoids all of the scenarios represented here, I challenge you to contribute back a new scenario that your harness fails to handle!&lt;/p&gt;
&lt;p&gt;There are infinite ways an agentic system could be tricked by a malicious input. Solutions to this problem need to sit outside of the realms of prompts and probabilistic filters.&lt;/p&gt;
&lt;h2&gt;Contributions welcome&lt;/h2&gt;
&lt;p&gt;What we’re releasing today is only the tip of the iceberg. We need your help to build out the library of scenarios.&lt;/p&gt;
&lt;p&gt;If you have ideas for new scenarios and want to contribute to the project, please do! Open an &lt;a href="https://github.com/prime-radiant-inc/scenarios/issues"&gt;issue in the repository&lt;/a&gt; to talk to us about your plans.&lt;/p&gt;</content></entry><entry><title>What We're Working On</title><link href="https://primeradiant.com/blog/2026/what-we-are-working-on.html" /><id>https://primeradiant.com/blog/2026/what-we-are-working-on.html</id><updated>2026-02-24T02:47:55Z</updated><summary>Managing agentic development</summary><author><name>Jesse</name></author><content type="html">&lt;p&gt;Hi!&lt;/p&gt;
&lt;p&gt;So, I guess this is the first post on the corporate blog. It's probably about time for me to introduce myself. &lt;/p&gt;
&lt;p&gt;Hey, I'm Jesse Vincent. I'm the founder and CEO of Prime Radiant. In previous lives, I've started &lt;a href="https://keyboard.io"&gt;a small keyboard company&lt;/a&gt;. I've started &lt;a href="https://bestpractical.com"&gt;a small ticketing system company&lt;/a&gt;. I helped run &lt;a href="https://en.wikipedia.org/wiki/VaccinateCA"&gt;VaccinateCA&lt;/a&gt;, the nonprofit that helped Californians figure out where they could get COVID shots. I created an email client called K-9 Mail for Android that got adopted by Thunderbird and is now &lt;a href="https://www.thunderbird.net/en-US/mobile/"&gt;Thunderbird for Android&lt;/a&gt;. I used to be responsible for the Perl programming language. And I've done some &lt;a href="https://blog.fsck.com"&gt;other stuff&lt;/a&gt;. &lt;/p&gt;
&lt;p&gt;For the past year, I've been relatively nose down working with coding agents. In the AI world, I'm probably best known for &lt;a href="https://github.com/obra/superpowers"&gt;Superpowers&lt;/a&gt;, an agentic skills framework and development methodology that I built initially for Claude Code and that now runs on a whole bunch of other agent platforms.&lt;/p&gt;
&lt;p&gt;Around the beginning of this year, I founded Prime Radiant. &lt;/p&gt;
&lt;p&gt;Prime Radiant isn't exactly "The Superpowers Company", but being the CEO means that I get to spend corporate resources supporting and giving away Superpowers. &lt;/p&gt;
&lt;p&gt;Broadly speaking, we're doing AI stuff. Everything we make is built with agents.&lt;/p&gt;
&lt;p&gt;The first time I made the transition to "agentic" development, it was soul-crushing. &lt;/p&gt;
&lt;p&gt;I hated it. &lt;/p&gt;
&lt;p&gt;It seemed like I was throwing away a lot of what made me a productive engineer.  My job had been writing and reading code. &lt;/p&gt;
&lt;p&gt;I'd come home at the end of the day, and feel like I hadn't done &lt;em&gt;anything&lt;/em&gt; at work.&lt;/p&gt;
&lt;p&gt;Because I hadn't written any code.&lt;/p&gt;
&lt;p&gt;What did I spend my day on? I helped figure out what "we" were working on. I did some coaching. Sometimes there might have been code review but really there wasn't even very much of that. My job was to figure out what we were doing, to write about it in plain English, and to help make sure that folks were able to turn it into reality. &lt;/p&gt;
&lt;p&gt;I was still working in an 80x24 terminal window, but the closest I got to coding was planning, pointing out errors, and begging for better test coverage.  &lt;/p&gt;
&lt;p&gt;It was a really rough transition.&lt;/p&gt;
&lt;p&gt;Once I got through it, it was amazing. Suddenly, the things that I wanted to do just happened. I had five engineers doing what I asked. And over time, they got better at it. At least in part because I got better at managing them. &lt;/p&gt;
&lt;p&gt;Pretty quickly, I came to realize that the code was always a means to an end. I wanted to ship product to people. I wanted to make stuff. And I was doing more of that than I could do by myself.&lt;/p&gt;
&lt;p&gt;That was a couple decades ago.&lt;/p&gt;
&lt;p&gt;Fast forward to 2025.&lt;/p&gt;
&lt;p&gt;Making the transition to agentic development over the past year has felt pretty natural for me, at least partially because I've done it before.  I've found myself having one of the most prolific periods of my career. My GitHub graph has been basically solid green for the past year. &lt;/p&gt;
&lt;p&gt;I haven't been writing any code. The last code I wrote was three lines of shell script in October, 2025. (I haven't been reading much code, either, but that's a story for another day.) I'm working harder on software than I've worked in a long time. And I'm making lots of stuff. &lt;/p&gt;
&lt;p&gt;I'm actually making so much stuff that I lose track of the projects I'm working on and where I got to on them.&lt;/p&gt;
&lt;p&gt;And that's what brings us to today's blog post. &lt;/p&gt;
&lt;p&gt;Like many other folks, I've built a bunch of tools that help me comprehend Claude Code's logs. Back in October, the first one was my &lt;a href="https://blog.fsck.com/2025/10/23/episodic-memory/"&gt;episodic memory plugin&lt;/a&gt; for Claude Code. It imports your conversation history into a place that Claude can see it and indexes it and makes it searchable. And then it gives Claude a skill and a sub-agent to do that searching. Buried inside the plugin was a tool that rendered those transcripts as HTML. &lt;/p&gt;
&lt;p&gt;Last month, we put together a centralized corporate agent log archive, so that we could see how each of us was prompting our agents, and everyone would be able to access historical records of the development work that is creating the software we're building. It comes with a Claude Code skill to auto-sync your transcripts on exit, and it is designed to run in some central place that your entire team can get to. As a heads up, you should run it behind a firewall or on a tailnet, because this makes everything in your transcripts public to anybody who can get to the website.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/prime-radiant-inc/claude-session-viewer"&gt;claude session viewer&lt;/a&gt; should actually support Codex sessions too now, but it got named what it got named. &lt;/p&gt;
&lt;p&gt;Neither of these tools really answer the question that I've been finding myself overwhelmed by lately:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;What did I work on today?&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;And so over the weekend, I started putting together an automatic engineering notebook. It syncs all of my Claude Code sessions from all of the places that I run them into a central archive on my laptop. It uses the Claude Agents SDK to summarize what I did in each session and whether there are any open threads or unresolved issues that were obvious from the conversation. And then it presents that information in a few different ways:&lt;/p&gt;
&lt;p&gt;In a journal view that shows day by day what projects I worked on and what I did on them.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Journal view" src="images/what-we-are-working-on/journal.png" /&gt;&lt;/p&gt;
&lt;p&gt;In a view by project showing me what I did on each project day by day.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Project view" src="images/what-we-are-working-on/log.png" /&gt;&lt;/p&gt;
&lt;p&gt;In a calendar view, so I can get a sense of the cadence of my projects.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Calendar view" src="images/what-we-are-working-on/calendar.png" /&gt;&lt;/p&gt;
&lt;p&gt;And there's also an iCalendar feed that I can subscribe to in my desktop calendar app to see all of this data as a retrospective calendar.&lt;/p&gt;
&lt;p&gt;All of those views make it easy for me to tell you that today I worked on scaffolding a new coding agent design, using terminal-bench to tune another coding agent (currently at about a 65% pass rate), running a malware audit against the openclaw skills hub, implementing engineering-notebook, and migrating a couple of our internal tools from AWS Fargate to EC2. Over the past month, it looks like I've worked on at least 23 different software projects across Swift, Typescript, Rust, Go, Python, and C++. &lt;/p&gt;
&lt;p&gt;&lt;code&gt;engineering-notebook&lt;/code&gt; is implemented in Typescript and runs locally on your computer with &lt;a href="https://bun.sh"&gt;Bun&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/prime-radiant-inc/engineering-notebook"&gt;It's open source and available on GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;And, if you've read this far, I'm only a little sorry that the title of this blog post was a bit of a tease. &lt;/p&gt;
&lt;p&gt;We've got a handful of other projects that we're getting ready to open source. Many of them are tools for folks who make software, but not a single one of them has been coded by a human.&lt;/p&gt;</content></entry></feed>