Retries look easy at first:
if request failed: retry
In production, that tiny line can make your system either resilient or unstable.
We learned this the hard way in our notification system.
We had traffic spikes where a lot of notifications were triggered in a short window. Our downstream provider started timing out and returning transient errors.
At that point, we were still on linear backoff with no jitter. It looked simple, but under spike conditions it kept retries too synchronized.
But in reality, retries lined up too closely. We got repeated retry waves: a big spike, then another spike, then another smaller one. It eventually recovered, but only after several rounds of avoidable pressure.
This is where backoff and jitter matter.
Imagine a burst of notification jobs hitting one downstream API. The API returns 503 for a short period.
If every client retries immediately, then retries again immediately, you get synchronized spikes:
The dependency has no breathing room to recover.
Even worse, if you have multiple layers (frontend -> service A -> service B -> database), retries can multiply across layers. A single user action can fan out into many repeated calls.
Backoff means waiting before retrying. Exponential backoff increases the wait each time.
Example with base delay of 100ms:
100ms200ms400ms800msUsually you also cap the delay:
delay = min(maxDelay, base * 2^attempt)
This reduces pressure on the dependency and avoids burning your own CPU/network on tight retry loops.
Without jitter, all clients using the same backoff formula still retry at nearly the same timestamps.
Jitter randomizes each delay so retries spread out over time. That smooths traffic and gives recovering systems a chance.
Common strategies:
sleep = random(0, backoffDelay)sleep = backoffDelay/2 + random(0, backoffDelay/2)Full jitter is often a strong default, and AWS guidance plus simulations show why it works well in many systems.
+-50% jitterOur journey looked like this:
+-10% jitter (second improvement).+-50% jitter (final choice).Each step improved things, but only the last one spread retries enough for our notification spikes while still keeping a minimum retry delay.
Example with base backoff 30ms:
30ms+-10% jitter: retries still cluster in a narrow 27-33ms window+-50% jitter: retries spread much wider in a 15-45ms window0-30msWhen many retries are involved, +-10% still creates dense mini-spikes. They are smaller than no jitter, but still synchronized enough to keep overwhelming the same downstream bottleneck.
In theory, full jitter spreads load even more aggressively. But for our notification flow, we intentionally wanted to keep a minimum delay between attempts and avoid retries happening too close to zero delay.
So we moved to +-50% jitter. That gave us much better spreading than +-10%, while preserving a floor on delay.
Another way to think about it:
+-10% jitter mostly shifts a spike.+-50% jitter actually broadens and flattens it for our latency budget.Not every error is retryable.
Usually retryable:
429 (rate limited)502, 503, 504Usually not retryable:
400, 401, 403, 404Also: only retry operations that are idempotent or protected by an idempotency key. Stripe has an excellent practical write-up on why this matters in real APIs.
Retries help with short-lived failures. They do not solve systemic overload by themselves.
Pair retries with:
If your dashboards do not show retry volume and success-after-retry rates, you are mostly flying blind.
Retries are a powerful tool, but only when controlled.
The formula is simple:
Done well, retries make your system graceful under turbulence. Done poorly, they become the turbulence.
In our case, the biggest change was moving from linear/no-jitter behavior to exponential backoff with +-50% jitter, after evaluating exponential backoff without jitter and with +-10% jitter on the way.
Our timings are also intentionally short:
30ms1.5sThat combination fit our notification workload better than a full-jitter strategy with near-zero delays.