Concept: prove it, don't promise it

We love local LLMs. We built codehamr to get the most out of local models on hardware you already own. Two habits make that work: the agent checks its own work, and the agent stays small. This guide is about both.

The problem

You ask a local coding agent to fix a failing test. It edits a file, says "Done, test passes now." You run the test. Still red.

This is the default failure mode of local LLMs. They pattern-match what "finished" sounds like and write it. That single behavior is what makes most local agents hard to trust.

The fix: prove it, don't promise it

codehamr runs one plain loop. It calls its four tools (bash, read_file, write_file, edit_file) until the work is done, then replies. There is no "done" button to press and no checkpoint to clear. A turn simply ends when the agent stops calling tools and hands control back to you.

What fills that loop is a habit the system prompt instills: after a meaningful change, check it with whatever actually proves it works, then keep going. Compile it. Run the test you just touched. Hit the endpoint. Open the page and watch it behave. Verification is something the agent does, not a gate it has to beg past.

Two rules keep the checks honest

A habit is only worth something if the checks are real. The system prompt holds the agent to two rules.

A check has to fail when the thing is broken. A script that prints a status and exits 0 without asserting on it is a false green. So is quietly silencing a check to make it pass: || true, 2>/dev/null, # type: ignore, or deleting the assertion that was failing. The agent fixes the root cause instead.

No manufactured proof. Counting braces, grepping for a function name, or restating a file's byte size proves nothing about whether the code runs. That is busywork dressed up as verification. Either run the real thing or move on. For a bug fix, the agent writes the failing test first, then fixes the code until it goes green, so you watch the red turn green.

Running it is the proof

For anything you click or look at — a web page, a canvas demo, a form — reading the source is not enough. It will not catch a dead button or a script that throws the moment it loads. So codehamr drives the real interaction: it launches a headless browser, performs the action (clicks the start button, presses the keys, submits the form), and asserts the state actually changed. "It loads" is not "it works."

Why codehamr is minimal everywhere else

Same reasoning. codehamr ships with three slash commands, one embedded system prompt, and those four tools. No MCP, no subagents, no skill system, no router, no memory file. A local LLM has only so much context, and quality per token drops as the prompt grows. Every byte of harness is a byte not spent on your code.

Project rules do not live in a config file either. You tell the agent what matters in the chat, and the conversation carries it. Lean is not an aesthetic. codehamr is simple on purpose.

What this does not solve

Checking that a test ran does not prove the test was meaningful. A weak test still goes green. That gap is exactly why the honesty rules exist, why bug fixes start red then green, and why you still watch every step live in your terminal. Three layers stack. No single one is magic.

The system prompt that instills all of this is embedded in the binary and open in the repo. Read it here.

The model has to prove it. Not promise it.