Duplicate battles on tournament Saturday
Symptom
First public tournament. Within the first couple of hours we had a handful of battles where two rooms got created for the same pairing — both players saw a 'Battle starting' modal, joined two parallel games, and wondered which one was real.
First hypothesis (and where it went wrong)
My first guess was a WebSocket reconnect race on shaky mobile networks. I spent an hour on that theory before realizing reconnects were behaving correctly. The real culprit showed up in the handler-duration histogram: our start-battle path had a long tail around 2.3–2.6s when the Postgres replica was under load, and the Redlock TTL was 2 seconds. Lock was expiring mid-handler, second request would grab its own lock, two rooms.
Fix
Bumped the TTL to 8 seconds and added Redlock auto-renewal — a 500ms heartbeat that calls `lock.extend()` while the handler runs. Also added a `handler_duration_seconds` Prometheus histogram so we'd notice next time a handler started creeping toward the lock TTL. The code change was small, but the habit of 'always measure before you pick a TTL' was the real takeaway.
Confirmed by
The duplicate-battle counter stayed at 0 through the next tournament. The new histogram also surfaced two slow paths (a missing join index and a chatty N+1 in the ranking query) — neither had been loud enough to matter on their own, but both were already halfway to the old TTL.




