Prime Radiant
← Back to Blog

Heartstone: I Asked an AI for Five Prompts and a Sentient House Won

Hearthstone:

I'm Matt. I recently joined Prime Radiant, where we build things with AI agents rather than writing code directly. Jesse's written about what that transition looks like.


We have a handful of household documents in Google Docs. Our house manual, emergency contacts, childcare, dog care. They started out targeted and approachable but over time they've grown to answer questions and address issues that have come up. How do you get on the WiFi? Do I use light switches or HomeKit? Sonos? Do I need to do anything to the pool if it rains? Where are the dogs' mortal enemies so I can avoid them on a walk?

As useful as they are, they're unwieldy. I'm sure our house guests feel some way when we hand them the Google equivalent of a binder full of documents. The super obvious 2024 solution is: use an LLM! Which we did and it was pretty good. I put all my docs in a Claude Project and asked it questions about them. Good enough, works for me!

It worked well enough. Except that it wasn't super clear how to make it available to anyone. I could ask everyone to set up Claude and share the Project, maybe? Or I could add them to the Docs as usual and suggest they import them into their harness of choice?

What else could we try? Let's combine the trends of 2009+2024: Build an App! Now let's be clear, I know asking our guests to install a bespoke app is a wild leap. I think one of the most exciting parts about software development in 2026 is how much easier it is to experiment and see what works. How are the ergonomics? Does it feel better? Does it address the problem? Do other folks have the same problem?


Now, to build this app I'd need some mix of: iOS, a backend, AI APIs, embeddings, vector search, storage, RAG, and maybe an evaluation framework. It's not my first rodeo. Most of my career has been diving into the unknown. I approach most ambiguous projects by finger painting — I've heard a few folks call it sketching. Get all the parts connected end-to-end as quickly as possible. I'm not committed to any of the decisions. I avoid going deeper than necessary at any stage. Think breadth-first rather than depth-first. You could call it prototyping, but folks almost always plan to throw those away. I don't assume that'll happen. Finger painting isn't about making high art, but it's not immediately trash either. You can always iterate on it.

Working with AI agents feels like a natural extension. I used a beta build of the next generation of superpowers:brainstorm — we'll be releasing it as a research preview soon — to seed the project. Start with the problem, talk through requirements and constraints, iterate on UX, then arrive at a detailed spec. Hand the spec to your favorite harness and you're off to the races. Agent distributes the work to subagents — iOS, backend, storage, deployment. If you've managed engineering teams, you've seen this pattern. PRD to plans to tasks to implementation. Same thing, robots.

I'm not going to touch on all the parts, but since we're an AI company and I'm learning about AI by building with AI — RAG and prompt engineering with evals are worth getting into.


I've used Postgres full-text search extensively over the years. Given a table of documents we'd have a tsvector column that's derived from another column, same row, optionally there might be some transformation. There's a vector column for vector search, so I imagined the ergonomics would be very similar to tsearch. Not so much. It came up during the spec phase. I was asked if I had opinions on what approach we should use for chunking documents. Chunking? Approach?

Cool. Cool. Don't go deep. Let the AI choose and revisit.

Could I skip RAG entirely and just put all the documents in the prompt? For thirty pages of household docs, probably. Totally. But cost scales with context. I've spent decades building and scaling SaaS services. I've become modestly thrifty by default. Can we reduce COGS? Yes. RAG lets us retrieve just the chunks of our documents that relate to the question.

Our household docs are in Google Docs, exported as DOCX, and then converted to Markdown via Pandoc. Markdown is great. We get formatting that's easy to work with, some semantic structure from the headings, and agents are quite adept at reading it.

My partner and I accidentally learned something about how we use Google Docs. We both have a tendency to use bold text where we could have used a heading. In a visual document this doesn't matter much but it degraded the efficacy of our chunking, so now we detect that pattern and promote it to a heading before the chunker runs. The chunker splits on major headings with a soft limit around 500 tokens. Tables stay whole. Tiny fragments get merged into their neighbors.

Where it got fun (more fun?) was what Anthropic calls Contextual Retrieval. When you embed a chunk for search, you don't just embed the raw text — you prepend the document title and section path. A chunk about feeding schedules from the childcare doc carries Childcare > Daily Schedule > Meals into the vector space. The search gets structural context; the stored chunk stays clean. Anthropic's full version of this uses an LLM to generate context per chunk.

Experimentation is virtually free, so I might have a Bob* take a swing at that in the near future and see how it plays out.

*All of my agents are Bobs. More on that in another post.


I've written countless prompts at this point. Every interaction is a prompt. I try to learn from interactions to write better prompts. I've written them by hand and I've done it with agents. What I hadn't tried before was improving a prompt systematically. Right or wrong, it never felt like I had a sufficiently constrained problem to try. Household documents, on the other hand, are much easier to reason about. We can much more easily construct a battery of questions, acceptable and unacceptable answers. It's not trivial, there is some art in deciding what counts as valid answers, but it's easier to score "Did it provide the WiFi SSID and password?" than "Does this program halt and catch fire?"

We start with a prompt that seemed solid. Then we use that prompt to answer our battery of questions. Our battery is thirty-nine questions across six personas. We have a Judge, a separate LLM, that scores each answer against a checklist of facts and anti-hallucination facts. Finally an Optimizer, yet another LLM, runs in a loop: adjust the prompt, re-run the eval, keep the changes only if there's a Pareto improvement. Meaning the score on at least one of thirty-nine questions improved, and none of the scores declined. Repeat until you're tired of sacrificing tokens to the machine.

The optimizer is based on optimize_anything from Berkeley and Anthropic. The key detail: when a score declines, the optimizer sees why it failed, not just that it did. That feedback loop is what makes the iteration productive rather than random.


We had a solid, optimized prompt. And then an agent took some liberties with its assignment and rewrote the prompt to be generic and purged prior versions from git history. After the initial frustration, it was time to start again.

I've been doing a thing recently where I ask Claude to do five versions of whatver, and at least two of them should be moderately unhinged. It seems especially beneficial on creative endeavors. It's not magic but it seems to help Claude not sound so much like, well, Claude.

  • v1 — Knowledgeable Neighbor
  • v2 — Concierge
  • v3 — House Speaks
  • v4 — Drill Sergeant
  • v5 — Empathetic Completionist

Here's the thing about AI agents: they default to a detectable blandness. My partner and I can tell ChatGPT from Claude in blind excerpts — they each have a flavor, not bad, but detectable. The unhinged options aren't jokes. They're escape velocity from the default voice. You have to push further from the norm than feels comfortable to get something with actual character.

We did the same optimization loop (n=30) for each prompt. The "House Speaks" prompt did well. It wasn't an absolute winner but it reminded me of a character, Charlotte Manor, from Drew Hayes's Fred the Vampire Accountant series. Charlotte is a magical entity in the form of a house — magic manifested by a group of mages that somewhere along the way became sentient. Leaning a bit more into unhinged, I had my agent adapt the House Speaks prompt to incorporate some of Charlotte's essence.

And... Charlotte won. So now Charlotte is the prompt persona of Hearthstone.


Hearthstone is open source and available on GitHub. It's built with Bun and TypeScript, uses SQLite for storage and vector search, and runs on Fly.io. Can I set this up for my home? Yes! You can run it yourself.

M
Matt Windbrook
MTS