Retries look easy at first:

if request failed: retry

In production, that tiny line can make your system either resilient or unstable.

We learned this the hard way in our notification system.

We had traffic spikes where a lot of notifications were triggered in a short window. Our downstream provider started timing out and returning transient errors.

At that point, we were still on linear backoff with no jitter. It looked simple, but under spike conditions it kept retries too synchronized.

But in reality, retries lined up too closely. We got repeated retry waves: a big spike, then another spike, then another smaller one. It eventually recovered, but only after several rounds of avoidable pressure.

This is where backoff and jitter matter.

Why naive retries fail

Imagine a burst of notification jobs hitting one downstream API. The API returns 503 for a short period.

If every client retries immediately, then retries again immediately, you get synchronized spikes:

Original traffic spike.
First retry spike.
Second retry spike.

The dependency has no breathing room to recover.

Even worse, if you have multiple layers (frontend -> service A -> service B -> database), retries can multiply across layers. A single user action can fan out into many repeated calls.

The core pattern: exponential backoff

Backoff means waiting before retrying. Exponential backoff increases the wait each time.

Example with base delay of 100ms:

attempt 1 retry delay: 100ms
attempt 2 retry delay: 200ms
attempt 3 retry delay: 400ms
attempt 4 retry delay: 800ms

Usually you also cap the delay:

delay = min(maxDelay, base * 2^attempt)

This reduces pressure on the dependency and avoids burning your own CPU/network on tight retry loops.

Why jitter is non-negotiable

Without jitter, all clients using the same backoff formula still retry at nearly the same timestamps.

Jitter randomizes each delay so retries spread out over time. That smooths traffic and gives recovering systems a chance.

Common strategies:

Full jitter: sleep = random(0, backoffDelay)
Equal jitter: sleep = backoffDelay/2 + random(0, backoffDelay/2)
Decorrelated jitter: Next delay is random between a base and ~3x previous delay (bounded by max).

Full jitter is often a strong default, and AWS guidance plus simulations show why it works well in many systems.

Why we went with `+-50%` jitter

Our journey looked like this:

Linear backoff, no jitter (original setup).
Exponential backoff, no jitter (first improvement).
Exponential backoff, +-10% jitter (second improvement).
Exponential backoff, +-50% jitter (final choice).

Each step improved things, but only the last one spread retries enough for our notification spikes while still keeping a minimum retry delay.

Example with base backoff 30ms:

no jitter: all retries at 30ms
+-10% jitter: retries still cluster in a narrow 27-33ms window
+-50% jitter: retries spread much wider in a 15-45ms window
full jitter: retries spread across 0-30ms

When many retries are involved, +-10% still creates dense mini-spikes. They are smaller than no jitter, but still synchronized enough to keep overwhelming the same downstream bottleneck.

In theory, full jitter spreads load even more aggressively. But for our notification flow, we intentionally wanted to keep a minimum delay between attempts and avoid retries happening too close to zero delay.

So we moved to +-50% jitter. That gave us much better spreading than +-10%, while preserving a floor on delay.

Another way to think about it:

+-10% jitter mostly shifts a spike.
+-50% jitter actually broadens and flattens it for our latency budget.
full jitter can retry too early for our use case, when the downstream service still has not recovered.

What should be retried?

Not every error is retryable.

Usually retryable:

network timeouts
connection resets
HTTP 429 (rate limited)
HTTP 502, 503, 504

Usually not retryable:

HTTP 400, 401, 403, 404
validation errors
business rule violations
authentication/authorization failures (unless token refresh is part of logic)

Also: only retry operations that are idempotent or protected by an idempotency key. Stripe has an excellent practical write-up on why this matters in real APIs.

Anti-patterns to avoid

Retrying forever.
Retrying every error blindly.
No timeout or deadline.
Using retries as a substitute for fixing latency/availability issues.
Stacking retries at every layer without coordination.

Retries are only one part of resilience

Retries help with short-lived failures. They do not solve systemic overload by themselves.

Pair retries with:

circuit breakers to fail fast when a dependency is clearly unhealthy.
concurrency limits to cap in-flight work and avoid overload.
observability to track retry volume and success-after-retry.
and more, like queues or load shedding, depending on your system.

If your dashboards do not show retry volume and success-after-retry rates, you are mostly flying blind.

Conclusion

Retries are a powerful tool, but only when controlled.

The formula is simple:

Retry only transient failures.
Back off exponentially.
Add jitter.
Bound everything with attempt and time limits.

Done well, retries make your system graceful under turbulence. Done poorly, they become the turbulence.

In our case, the biggest change was moving from linear/no-jitter behavior to exponential backoff with +-50% jitter, after evaluating exponential backoff without jitter and with +-10% jitter on the way.

Our timings are also intentionally short:

initial backoff: 30ms
max retry timeout cap: around 1.5s

That combination fit our notification workload better than a full-jitter strategy with near-zero delays.

References

AWS Architecture Blog: Exponential Backoff and Jitter
Amazon Builders Library: Timeouts, retries, and backoff with jitter
Stripe Blog: Designing robust and predictable APIs with idempotency

← Previous Post Next Post →

Retries, Backoff, and Jitter

How to stay resilient without DDoSing your dependencies

Why naive retries fail

The core pattern: exponential backoff

Why jitter is non-negotiable

Why we went with `+-50%` jitter

What should be retried?

Anti-patterns to avoid

Retries are only one part of resilience

Conclusion

References

Why naive retries fail

The core pattern: exponential backoff

Why jitter is non-negotiable

Why we went with +-50% jitter

What should be retried?

Anti-patterns to avoid

Retries are only one part of resilience

Conclusion

References

Why we went with `+-50%` jitter