Skip to main content

Idempotency Keys: Making Money-Moving Endpoints Safe to Retry

11 min read
Idempotency
Payments
Distributed Systems
System Design
Backend
Stripe
Reliability
Node.js
SC
Written by Shailesh Chaudhari
Full-stack engineer with a backend focus
TL;DR: Any endpoint that moves money or creates a resource will eventually be called twice for one intent — a network retry, a double-click, a redelivered webhook. The fix is an idempotency key: a client-supplied id that lets the server run the operation at most once and replay the stored result on every repeat. The subtle parts are concurrency (two retries arriving at the same instant), releasing the key when the operation fails, and rejecting a key reused for a different request. I packaged the whole pattern into idempotency-kit, a zero-dependency TypeScript library — this post walks through the design.

The bug that hides until production traffic

Hello everyone! I'm Shailesh Chaudhari, a backend engineer. Here's a charge endpoint that looks completely fine:

app.post("/charge", async (req, res) => {
  const charge = await psp.charge(req.body); // call the payment provider
  await db.orders.insert({ chargeId: charge.id });
  res.json(charge);
});

It works in every test. Then it ships, and a week later a customer is charged twice for one order. Nobody wrote a bug — the network did. The client sent the request, the charge went through, and the response got lost on the way back. The client's HTTP library, seeing no response, did the sensible thing: it retried. The server, having no memory of the first call, charged again.

This is not an edge case you can avoid. Mobile clients retry on flaky connections. Load balancers retry on timeouts. Users double-click "Pay". Payment providers redeliver webhooks until you return a 2xx — sometimes more than once even after you do. At-least-once delivery is the default of the internet. If your write endpoint isn't safe to call twice, it's a question of when, not if.

The idea: an idempotency key

An idempotency key is a unique id the client generates per intent (a UUID is fine) and sends with the request, usually as an Idempotency-Key header. The contract on the server is:

  • First time I see this key: run the operation, store the result against the key, return it.
  • Every later time I see the same key: do not run again — return the stored result.

One intent, one execution, no matter how many times the request arrives. This is exactly how Stripe's Idempotency-Key works, and the same exactly-once idea behind the reservation engine I wrote about in Holdfast — there enforced with a UNIQUE constraint on the order key.

The naive implementation is a check-then-act, and it has the same flaw as the double charge:

// DON'T do this — it has a race
const seen = await store.get(key);
if (seen) return seen.result;
const result = await run();
await store.save(key, result);

Two retries arriving at the same instant both get nothing, both pass the if, both run(). We're back to a double charge — now with extra steps. The check and the write have to be a single atomic operation. That one realization shapes the entire design.

The protocol: claim → run → complete

The pattern that actually holds under concurrency splits the work into three store operations, where claim is the only one that must be atomic:

  • claim(key): atomically insert an in_progress record if the key is free. Returns "you own it" to exactly one caller; everyone else gets the existing record. In Redis this is a SET key value NX; in SQL it's an INSERT that trips a unique constraint.
  • complete(key, result): the owner runs the operation, then stores the serialized result and flips the record to completed.
  • release(key): if the operation throws, drop the in_progress claim so a later retry can try again.

Here is the core of idempotency-kit, lightly trimmed:

const existing = await store.claim(key, now());

if (existing === null) {
  // We won the claim — we own execution.
  try {
    const value = await fn();
    await store.complete(key, serialize(value), now());
    return { value, replayed: false };
  } catch (err) {
    await store.release(key); // transient failure shouldn't poison the key
    throw err;
  }
}

if (existing.status === "completed") {
  return { value: deserialize(existing.result), replayed: true };
}
// else: someone else holds an in_progress claim (see below)

Because claim is atomic, two simultaneous retries can't both win. One gets null and runs; the other gets the in_progress record and knows not to. The race is gone — not because the application code is clever, but because the one operation that must be atomic was pushed down to where atomicity is cheap and guaranteed.

What about the retry that arrives mid-flight?

The interesting case is the loser of the claim: a second request whose key is held by an in_progress claim that hasn't finished yet. There's no stored result to replay. You have two honest options, and the right one depends on the caller:

  • Fail fast. Return a 409 Conflict ("a request with this key is already in progress"). The client can retry later. This is the safe default and what Stripe does — it never blocks a connection waiting.
  • Wait and replay. Poll briefly for the winner to finish, then return its result. Nicer for the caller, but it holds a connection open, so it needs a tight bound.

The library makes this a one-line choice — default is fail-fast; opt into waiting with waitRetries:

await withIdempotency(key, fn, { store, waitRetries: 10, waitIntervalMs: 100 });

The detail everyone forgets: releasing on failure

Picture this: the operation fails — the payment provider times out, the database is briefly down. If you completed the key anyway, or simply left the in_progress claim sitting there, every future retry with that key would replay a failure or be told "already in progress" forever. The user's legitimate retry can never succeed. The key is poisoned.

That's why the catch block above calls release. A transient failure must free the key so the next attempt gets a clean shot. This is the difference between "idempotent" and "idempotent and actually usable". It's also the line most homegrown implementations miss, because it only bites when something downstream is already broken — exactly when you most need retries to work.

The guard Stripe ships and most clones don't: fingerprint mismatch

Here's a nastier failure. A client reuses an idempotency key for a different request — maybe it generates keys too coarsely, maybe it's a plain bug. Request one: charge $10. Request two, same key: charge $99. A naive cache happily replays the $10 result for the $99 request. No error, no double charge — just a silently wrong answer, which is worse, because nobody notices until reconciliation.

Stripe guards against this: reuse a key with different parameters and you get a 400, not a replay. To do the same, store a fingerprint of the request payload alongside the key, and compare on every hit:

await withIdempotency(key, () => psp.charge(req.body), {
  store,
  fingerprint: fingerprint(req.body), // canonical hash of the payload
});
// → throws IdempotencyFingerprintMismatchError if the key was first
//   used with a different body. Translate that to a 400.

The fingerprint has to be canonical: { amount, currency } and { currency, amount } are the same request, so key order can't change the hash. In the library I sort object keys recursively before hashing, recurse into nested objects and arrays, and deliberately avoid node:crypto so the same function runs on the edge, Deno, or a browser. It's a guard against accidental reuse, not an adversary — a fast non-cryptographic hash is the right tool.

One subtlety I only caught because a test caught it: the fingerprint must survive the in_progress → completed transition. My first version stored it on claim and then overwrote the record on complete, dropping it — so the mismatch check silently stopped working after the first call finished. The test that reused a key with a different payload after completion went red, and that's the whole point of writing the test that asserts the outcome rather than the one that asserts "it returned something".

Keys don't live forever (TTL)

Stored results can't accumulate without bound. Stripe expires idempotency keys after 24 hours, and that's a sensible default: long enough to cover any realistic retry storm, short enough that the store doesn't grow forever. After the TTL the key is free again. The trade-off is real — a retry that arrives after expiry will execute as a fresh request — so the window should comfortably exceed your longest retry path (including a payment provider's webhook redelivery schedule, which can span hours).

Where the atomicity actually lives

The thing I want to leave you with is an architecture point, not a payments point. Notice that none of the logic above is hard except the atomic claim. So that's the only thing the storage layer is responsible for. Everything else — the fail-fast-vs-wait policy, serialization, the fingerprint comparison, releasing on error — is plain, testable application code that doesn't care whether it's running over an in-memory map, Redis, or Postgres.

That's why the library defines a tiny store interface and ships an in-memory implementation for tests, and why moving to a distributed setup means writing one small adapter — a SET NX for the claim, nothing else changes:

interface IdempotencyStore {
  claim(key, now, fingerprint?): Promise<Record | null>; // atomic insert-if-absent
  complete(key, result, now): Promise<void>;
  release(key): Promise<void>;
  get(key): Promise<Record | null>;
}

"Push the one operation that must be atomic down to the layer that can guarantee it, and keep everything above it boring" is the same principle behind Holdfast's reservation path. Get that boundary right and the concurrency stops being scary.

Try it

The library is open source, fully typed, zero-dependency, and offline-testable (the clock is injectable, so tests are deterministic — no sleep, no real network):

If you're building anything that charges a card, sends a message, or creates a resource on a POST, make it idempotent before it ships — not after the first double-charge ticket. Claim atomically, replay on repeat, release on failure, reject mismatched reuse, and expire on a TTL. Thanks for reading!

SC
Written by Shailesh Chaudhari
Full-stack engineer with a backend focus