
Engineering Overview
promptproof is the small, boring tool that stops a prompt change from quietly breaking three things you weren't looking at. You define eval cases (input + expected + graders), run the suite, and save a baseline; on every change it diffs the new run against the baseline and gives a straight answer to 'did this make things worse?' — as a list of pass→fail regressions, not a feeling. It ships graders for exact match, substring includes, regex (with negate), JSON-shape, and an offline token-overlap scorer, and you can add your own in one function. The harness itself has zero runtime dependencies and runs offline in CI — you bring your model behind a single function, so testing the tool needs no API key.
The Problem
You change a prompt to fix one thing and three others quietly break — and you find out from a customer. LLM output is non-deterministic enough that ad-hoc manual checks miss regressions, and most teams have no baseline to diff against, so 'is this better or worse?' stays a vibe.
The Solution
A tiny eval harness built around regression diffing: define cases, grade outputs with included or custom graders, save a baseline, and on every change get the list of cases that went pass→fail. Zero dependencies and offline-runnable so it drops into CI, with your model behind one function.
Key Engineering Challenges
The deliberate design choice was making the harness testable without a model: graders and the diff engine are pure, and the model is a single injectable function, so promptproof's own test suite runs fully offline with no API key. The token-overlap grader exists for the same reason — a dependency-free, deterministic way to score free-text similarity in CI when you don't want to call a model just to test the tooling.
Core Capabilities
System Architecture
System Architecture
Suite
Graders
Regression diff
"Cases run through graders to a pass/fail per case; a saved baseline turns the next run into a diff. The point isn't a score — it's catching the specific cases that regressed, which is what fails the build."