← back to guides

Concept: built-in Get Your Shit Done

We love local LLMs. We built codehamr to get the most out of the ~30B class on hardware you already own. This guide is about the one mechanism that makes that work.

The problem

You ask a local coding agent to fix a failing test. It edits a file, says "Done, test passes now." You run the test. Still red.

This is the default failure mode of ~30B LLMs. They pattern match what "finished" sounds like and write it. That single behavior is what makes most local agents hard to trust.

The fix

The agent has to show the receipt.

Before claiming done, the agent must paste a real chunk of terminal output from a passing check. codehamr keeps a log of every check it ran and compares byte for byte. No matching receipt, no done.

This rule is GYSD, short for Get Your Shit Done. Mechanical. The agent cannot talk its way past it.

Example

Honest agent

The agent runs pytest test_login.py and gets this output.

===== 1 passed in 0.34s =====

codehamr saves that line in its log. The agent claims done and quotes the line as receipt.

===== 1 passed in 0.34s =====

codehamr finds the string in the log. Accepted.

Lying agent

The agent skips the test and claims done with this evidence.

tests pass now

codehamr searches the log. Nothing matches. Rejected. The agent reads the rejection and has to actually run a test before trying again.

Why local LLMs need this

Every LLM is prone to faking completion, top tier rarely, local ~30B much more often. GYSD makes the loop more robust on any model. No real terminal output, no done.

The win is biggest on local, where that robustness is what makes the ~30B class actually reliable.

Why codehamr is minimal everywhere else

Same reasoning. codehamr ships with no MCP, no subagents, no skill system, no router, no memory file. A ~30B LLM has the context for it. qwen3.6:27b runs fine at 256k. But on local, quality per token drops faster as length grows. Every byte of harness is a byte not spent on your code.

Lean is not aesthetic. It is simple on purpose.

What GYSD does not solve

GYSD proves the agent ran a passing check. It does not prove the check was meaningful. A bad test still gets a green checkmark.

That is why the system prompt forbids trivial checks like echo done and pushes the red then green pattern on bug fixes, and why you still watch every step live in your terminal. Three layers stack, no single one is magic.

The model has to prove it. Not promise it.