promptproof

TypeScriptLLM EvalRegression TestingCIZero-dependencyVitest

Engineering Overview

promptproof is the small, boring tool that stops a prompt change from quietly breaking three things you weren't looking at. You define eval cases (input + expected + graders), run the suite, and save a baseline; on every change it diffs the new run against the baseline and gives a straight answer to 'did this make things worse?' — as a list of pass→fail regressions, not a feeling. It ships graders for exact match, substring includes, regex (with negate), JSON-shape, and an offline token-overlap scorer, and you can add your own in one function. The harness itself has zero runtime dependencies and runs offline in CI — you bring your model behind a single function, so testing the tool needs no API key.

The Problem

You change a prompt to fix one thing and three others quietly break — and you find out from a customer. LLM output is non-deterministic enough that ad-hoc manual checks miss regressions, and most teams have no baseline to diff against, so 'is this better or worse?' stays a vibe.

The Solution

A tiny eval harness built around regression diffing: define cases, grade outputs with included or custom graders, save a baseline, and on every change get the list of cases that went pass→fail. Zero dependencies and offline-runnable so it drops into CI, with your model behind one function.

Key Engineering Challenges

The deliberate design choice was making the harness testable without a model: graders and the diff engine are pure, and the model is a single injectable function, so promptproof's own test suite runs fully offline with no API key. The token-overlap grader exists for the same reason — a dependency-free, deterministic way to score free-text similarity in CI when you don't want to call a model just to test the tooling.

Core Capabilities

Regression diffing — save a baseline, fail the build on any pass→fail

Graders included: exact, includes, regex (negate), JSON-shape, token-overlap

Bring-your-own grader and model — each in one function

Zero runtime dependencies — one small library + a CLI

Runs offline in CI — no API key needed to test the harness itself

TypeScript, ESM, fully typed

System Architecture

Suite

Cases: input + expected + graders

runSuite() executes a case against your model function

Graders

exact, includes, regex (with negate), jsonShape

offline token-overlap scorer

bring-your-own grader in one function

Regression diff

Save a baseline run

Diff new run vs baseline → fail CI on any pass→fail

"Cases run through graders to a pass/fail per case; a saved baseline turns the next run into a diff. The point isn't a score — it's catching the specific cases that regressed, which is what fails the build."

Quick Actions

View Source

Tech Stack

Language: TypeScript (ESM, fully typed)

Dependencies: none at runtime (library + CLI)

Graders: exact / includes / regex / jsonShape / token-overlap / custom

Tests: Vitest (21, offline)

Inquiry

Interested in discussing this engineering approach?

Message Shailesh →