Retries with Exponential Backoff and Jitter
Retries with Exponential Backoff and Jitter
Overview
In distributed systems, transient failures are inevitable - network hiccups, temporary service overload, rate limiting, database connection timeouts. The naive approach of immediately retrying failed requests can worsen the problem, creating thundering herds that amplify outages. Exponential backoff with jitter is a proven pattern that makes retries both resilient and respectful of downstream services.
The Problem: Thundering Herds
When a service experiences a brief outage and recovers, thousands of clients may retry simultaneously, creating a synchronized wave of traffic that overwhelms the recovering service. This retry storm can prevent the service from recovering, creating a self-perpetuating outage.
Example scenario:
- A database becomes temporarily unavailable
- 10,000 application instances fail simultaneously
- All instances retry after 1 second
- Database receives 10,000 requests simultaneously
- Database overloads and crashes again
- Cycle repeats
The Solution: Exponential Backoff with Jitter
Exponential backoff increases retry delays exponentially (1s, 2s, 4s, 8s…) to give services time to recover. Jitter adds randomness to prevent synchronized retries.
Key Components
- Base delay: Initial retry delay (e.g., 100ms)
- Multiplier: How much to increase each retry (typically 2x)
- Max delay: Upper bound to prevent infinite waits (e.g., 30s)
- Max attempts: Give up after N retries
- Jitter strategy: How to add randomness
Jitter Strategies
Full Jitter (Recommended):
delay = random(0, min(max_delay, base_delay * 2^attempt))
Provides maximum distribution, best for high-concurrency scenarios.
Equal Jitter:
temp = min(max_delay, base_delay * 2^attempt)
delay = temp/2 + random(0, temp/2)
Guarantees a minimum wait while adding randomness.
Decorrelated Jitter:
delay = min(max_delay, random(base_delay, prev_delay * 3))
Creates more variation between retries, further reducing correlation.
Implementation Examples
Go Implementation
package retry
import (
"context"
"errors"
"math"
"math/rand"
"time"
)
type Config struct {
BaseDelay time.Duration
MaxDelay time.Duration
MaxRetries int
}
// RetryableFunc is a function that can be retried
type RetryableFunc func(ctx context.Context) error
// ShouldRetry determines if an error is retryable
type ShouldRetry func(error) bool
// WithExponentialBackoff retries a function with exponential backoff and jitter
func WithExponentialBackoff(
ctx context.Context,
cfg Config,
shouldRetry ShouldRetry,
fn RetryableFunc,
) error {
var lastErr error
for attempt := 0; attempt <= cfg.MaxRetries; attempt++ {
// First attempt doesn't need delay
if attempt > 0 {
delay := calculateDelay(cfg, attempt)
select {
case <-time.After(delay):
case <-ctx.Done():
return ctx.Err()
}
}
lastErr = fn(ctx)
// Success
if lastErr == nil {
return nil
}
// Non-retryable error
if !shouldRetry(lastErr) {
return lastErr
}
// Context cancelled/deadline exceeded
if ctx.Err() != nil {
return ctx.Err()
}
}
return errors.New("max retries exceeded: " + lastErr.Error())
}
// calculateDelay implements full jitter exponential backoff
func calculateDelay(cfg Config, attempt int) time.Duration {
// Calculate exponential delay
expDelay := float64(cfg.BaseDelay) * math.Pow(2, float64(attempt-1))
// Cap at max delay
maxJitter := math.Min(expDelay, float64(cfg.MaxDelay))
// Apply full jitter: random between 0 and maxJitter
jitter := rand.Float64() * maxJitter
return time.Duration(jitter)
}
// Common retryable error patterns
func IsRetryable(err error) bool {
if err == nil {
return false
}
// Retryable conditions
switch {
case errors.Is(err, context.DeadlineExceeded):
return true
case errors.Is(err, ErrServiceUnavailable):
return true
case errors.Is(err, ErrRateLimited):
return true
case errors.Is(err, ErrConnectionFailed):
return true
default:
return false
}
}
Usage example:
func FetchData(ctx context.Context, url string) ([]byte, error) {
var result []byte
cfg := retry.Config{
BaseDelay: 100 * time.Millisecond,
MaxDelay: 10 * time.Second,
MaxRetries: 5,
}
err := retry.WithExponentialBackoff(
ctx,
cfg,
retry.IsRetryable,
func(ctx context.Context) error {
var err error
result, err = httpClient.Get(ctx, url)
return err
},
)
return result, err
}
Python Implementation
import asyncio
import random
from typing import Callable, Optional, TypeVar
from dataclasses import dataclass
T = TypeVar('T')
@dataclass
class RetryConfig:
base_delay: float = 0.1 # 100ms
max_delay: float = 10.0 # 10 seconds
max_retries: int = 5
jitter: str = "full" # "full", "equal", or "decorrelated"
class MaxRetriesExceeded(Exception):
pass
async def retry_with_backoff(
fn: Callable[..., T],
config: RetryConfig,
should_retry: Optional[Callable[[Exception], bool]] = None,
) -> T:
"""
Retry a function with exponential backoff and jitter.
Args:
fn: Async function to retry
config: Retry configuration
should_retry: Function to determine if error is retryable
Returns:
Result of successful function call
Raises:
MaxRetriesExceeded: When max retries reached
Exception: When non-retryable error occurs
"""
last_error = None
prev_delay = config.base_delay
for attempt in range(config.max_retries + 1):
try:
if attempt > 0:
delay = _calculate_delay(config, attempt, prev_delay)
await asyncio.sleep(delay)
prev_delay = delay
return await fn()
except Exception as e:
last_error = e
# Check if error is retryable
if should_retry and not should_retry(e):
raise
# Don't retry on last attempt
if attempt == config.max_retries:
raise MaxRetriesExceeded(
f"Max retries ({config.max_retries}) exceeded"
) from e
# Should never reach here
raise last_error
def _calculate_delay(
config: RetryConfig,
attempt: int,
prev_delay: float
) -> float:
"""Calculate delay with jitter based on strategy."""
# Exponential component
exp_delay = config.base_delay * (2 ** (attempt - 1))
if config.jitter == "full":
# Full jitter: random(0, min(max_delay, exp_delay))
max_jitter = min(config.max_delay, exp_delay)
return random.uniform(0, max_jitter)
elif config.jitter == "equal":
# Equal jitter: temp/2 + random(0, temp/2)
temp = min(config.max_delay, exp_delay)
return temp / 2 + random.uniform(0, temp / 2)
elif config.jitter == "decorrelated":
# Decorrelated: random(base_delay, prev_delay * 3)
delay = random.uniform(config.base_delay, prev_delay * 3)
return min(config.max_delay, delay)
else:
raise ValueError(f"Unknown jitter strategy: {config.jitter}")
# Retryable error checker
def is_retryable(error: Exception) -> bool:
"""Determine if an error should be retried."""
retryable_types = (
asyncio.TimeoutError,
ConnectionError,
# Add your custom exceptions here
)
return isinstance(error, retryable_types)
# Usage example
async def fetch_data(url: str) -> dict:
"""Fetch data with automatic retries."""
async def _fetch():
async with aiohttp.ClientSession() as session:
async with session.get(url, timeout=5) as response:
response.raise_for_status()
return await response.json()
config = RetryConfig(
base_delay=0.1,
max_delay=10.0,
max_retries=5,
jitter="full"
)
return await retry_with_backoff(_fetch, config, is_retryable)
ReactJS Implementation (API Client)
// retry.ts
interface RetryConfig {
baseDelay: number;
maxDelay: number;
maxRetries: number;
jitter: 'full' | 'equal' | 'decorrelated';
}
type ShouldRetryFn = (error: Error) => boolean;
export class MaxRetriesExceededError extends Error {
constructor(message: string, public lastError: Error) {
super(message);
this.name = 'MaxRetriesExceededError';
}
}
export async function retryWithBackoff<T>(
fn: () => Promise<T>,
config: RetryConfig,
shouldRetry?: ShouldRetryFn
): Promise<T> {
let lastError: Error | null = null;
let prevDelay = config.baseDelay;
for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
try {
if (attempt > 0) {
const delay = calculateDelay(config, attempt, prevDelay);
await sleep(delay);
prevDelay = delay;
}
return await fn();
} catch (error) {
lastError = error as Error;
// Check if retryable
if (shouldRetry && !shouldRetry(lastError)) {
throw error;
}
// Last attempt
if (attempt === config.maxRetries) {
throw new MaxRetriesExceededError(
`Max retries (${config.maxRetries}) exceeded`,
lastError
);
}
}
}
throw lastError;
}
function calculateDelay(
config: RetryConfig,
attempt: number,
prevDelay: number
): number {
const expDelay = config.baseDelay * Math.pow(2, attempt - 1);
switch (config.jitter) {
case 'full': {
const maxJitter = Math.min(config.maxDelay, expDelay);
return Math.random() * maxJitter;
}
case 'equal': {
const temp = Math.min(config.maxDelay, expDelay);
return temp / 2 + Math.random() * (temp / 2);
}
case 'decorrelated': {
const delay = Math.random() * (prevDelay * 3 - config.baseDelay) + config.baseDelay;
return Math.min(config.maxDelay, delay);
}
}
}
function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
// Check if HTTP error is retryable
export function isRetryableHttpError(error: Error): boolean {
if ('status' in error) {
const status = (error as any).status;
// Retry on 5xx server errors and 429 rate limits
return status >= 500 || status === 429;
}
// Retry on network errors
return error.message.includes('fetch') || error.message.includes('network');
}
// React hook for retryable API calls
import { useState, useCallback } from 'react';
export function useRetryableApi<T>(config?: Partial<RetryConfig>) {
const [loading, setLoading] = useState(false);
const [error, setError] = useState<Error | null>(null);
const defaultConfig: RetryConfig = {
baseDelay: 100,
maxDelay: 10000,
maxRetries: 5,
jitter: 'full',
...config,
};
const execute = useCallback(
async (fn: () => Promise<T>): Promise<T | null> => {
setLoading(true);
setError(null);
try {
const result = await retryWithBackoff(
fn,
defaultConfig,
isRetryableHttpError
);
return result;
} catch (err) {
setError(err as Error);
return null;
} finally {
setLoading(false);
}
},
[defaultConfig]
);
return { execute, loading, error };
}
Usage in React component:
function DataFetcher() {
const { execute, loading, error } = useRetryableApi({
maxRetries: 3,
baseDelay: 200,
});
const [data, setData] = useState(null);
const fetchData = async () => {
const result = await execute(() =>
fetch('/api/data').then(res => {
if (!res.ok) throw new Error(`HTTP ${res.status}`);
return res.json();
})
);
if (result) {
setData(result);
}
};
return (
<div>
<button onClick={fetchData} disabled={loading}>
{loading ? 'Loading...' : 'Fetch Data'}
</button>
{error && <div>Error: {error.message}</div>}
{data && <pre>{JSON.stringify(data, null, 2)}</pre>}
</div>
);
}
When to Use
Use exponential backoff with jitter when:
- Calling external APIs with rate limits
- Connecting to databases or message queues
- Making network requests in distributed systems
- Recovering from transient failures
Don’t use when:
- Errors are permanent (404 Not Found, 401 Unauthorized)
- Real-time requirements don’t allow delays
- Idempotency can’t be guaranteed (could create duplicates)
Trade-offs
Advantages:
- Prevents thundering herds and retry storms
- Gives services time to recover
- Adapts to varying load conditions
- Simple to implement and reason about
Disadvantages:
- Increases latency for failed requests
- Complexity in determining retry budgets
- May hide underlying issues if over-used
- Requires careful tuning for optimal parameters
Best Practices
- Set reasonable max delays: 30-60 seconds prevents infinite waits
- Limit total retries: 3-5 attempts is usually sufficient
- Use circuit breakers: Combine with circuit breakers to stop retrying failing services
- Log retry attempts: Monitor retry rates to detect systemic issues
- Make operations idempotent: Ensure retries don’t create duplicate side effects
- Respect retry budgets: Don’t let retries consume all request timeout budget
- Honor server backoff signals: If server provides Retry-After header, use it
Conclusion
Exponential backoff with jitter is a simple yet powerful pattern for building resilient distributed systems. By spreading retries over time and adding randomness, you protect downstream services while maximizing success rates. Implement this pattern consistently across your service mesh, and you’ll see fewer cascading failures and faster recovery from transient issues.