Retries with Exponential Backoff and Jitter

Overview

In distributed systems, transient failures are inevitable - network hiccups, temporary service overload, rate limiting, database connection timeouts. The naive approach of immediately retrying failed requests can worsen the problem, creating thundering herds that amplify outages. Exponential backoff with jitter is a proven pattern that makes retries both resilient and respectful of downstream services.

The Problem: Thundering Herds

When a service experiences a brief outage and recovers, thousands of clients may retry simultaneously, creating a synchronized wave of traffic that overwhelms the recovering service. This retry storm can prevent the service from recovering, creating a self-perpetuating outage.

Example scenario:

A database becomes temporarily unavailable
10,000 application instances fail simultaneously
All instances retry after 1 second
Database receives 10,000 requests simultaneously
Database overloads and crashes again
Cycle repeats

The Solution: Exponential Backoff with Jitter

Exponential backoff increases retry delays exponentially (1s, 2s, 4s, 8s…) to give services time to recover. Jitter adds randomness to prevent synchronized retries.

Key Components

Base delay: Initial retry delay (e.g., 100ms)
Multiplier: How much to increase each retry (typically 2x)
Max delay: Upper bound to prevent infinite waits (e.g., 30s)
Max attempts: Give up after N retries
Jitter strategy: How to add randomness

Jitter Strategies

Full Jitter (Recommended):

delay = random(0, min(max_delay, base_delay * 2^attempt))

Provides maximum distribution, best for high-concurrency scenarios.

Equal Jitter:

temp = min(max_delay, base_delay * 2^attempt)
delay = temp/2 + random(0, temp/2)

Guarantees a minimum wait while adding randomness.

Decorrelated Jitter:

delay = min(max_delay, random(base_delay, prev_delay * 3))

Creates more variation between retries, further reducing correlation.

Implementation Examples

Go Implementation

package retry

import (
    "context"
    "errors"
    "math"
    "math/rand"
    "time"
)

type Config struct {
    BaseDelay  time.Duration
    MaxDelay   time.Duration
    MaxRetries int
}

// RetryableFunc is a function that can be retried
type RetryableFunc func(ctx context.Context) error

// ShouldRetry determines if an error is retryable
type ShouldRetry func(error) bool

// WithExponentialBackoff retries a function with exponential backoff and jitter
func WithExponentialBackoff(
    ctx context.Context,
    cfg Config,
    shouldRetry ShouldRetry,
    fn RetryableFunc,
) error {
    var lastErr error
    
    for attempt := 0; attempt <= cfg.MaxRetries; attempt++ {
        // First attempt doesn't need delay
        if attempt > 0 {
            delay := calculateDelay(cfg, attempt)
            
            select {
            case <-time.After(delay):
            case <-ctx.Done():
                return ctx.Err()
            }
        }
        
        lastErr = fn(ctx)
        
        // Success
        if lastErr == nil {
            return nil
        }
        
        // Non-retryable error
        if !shouldRetry(lastErr) {
            return lastErr
        }
        
        // Context cancelled/deadline exceeded
        if ctx.Err() != nil {
            return ctx.Err()
        }
    }
    
    return errors.New("max retries exceeded: " + lastErr.Error())
}

// calculateDelay implements full jitter exponential backoff
func calculateDelay(cfg Config, attempt int) time.Duration {
    // Calculate exponential delay
    expDelay := float64(cfg.BaseDelay) * math.Pow(2, float64(attempt-1))
    
    // Cap at max delay
    maxJitter := math.Min(expDelay, float64(cfg.MaxDelay))
    
    // Apply full jitter: random between 0 and maxJitter
    jitter := rand.Float64() * maxJitter
    
    return time.Duration(jitter)
}

// Common retryable error patterns
func IsRetryable(err error) bool {
    if err == nil {
        return false
    }
    
    // Retryable conditions
    switch {
    case errors.Is(err, context.DeadlineExceeded):
        return true
    case errors.Is(err, ErrServiceUnavailable):
        return true
    case errors.Is(err, ErrRateLimited):
        return true
    case errors.Is(err, ErrConnectionFailed):
        return true
    default:
        return false
    }
}

Usage example:

func FetchData(ctx context.Context, url string) ([]byte, error) {
    var result []byte
    
    cfg := retry.Config{
        BaseDelay:  100 * time.Millisecond,
        MaxDelay:   10 * time.Second,
        MaxRetries: 5,
    }
    
    err := retry.WithExponentialBackoff(
        ctx,
        cfg,
        retry.IsRetryable,
        func(ctx context.Context) error {
            var err error
            result, err = httpClient.Get(ctx, url)
            return err
        },
    )
    
    return result, err
}

Python Implementation

import asyncio
import random
from typing import Callable, Optional, TypeVar
from dataclasses import dataclass

T = TypeVar('T')

@dataclass
class RetryConfig:
    base_delay: float = 0.1  # 100ms
    max_delay: float = 10.0  # 10 seconds
    max_retries: int = 5
    jitter: str = "full"  # "full", "equal", or "decorrelated"

class MaxRetriesExceeded(Exception):
    pass

async def retry_with_backoff(
    fn: Callable[..., T],
    config: RetryConfig,
    should_retry: Optional[Callable[[Exception], bool]] = None,
) -> T:
    """
    Retry a function with exponential backoff and jitter.
    
    Args:
        fn: Async function to retry
        config: Retry configuration
        should_retry: Function to determine if error is retryable
    
    Returns:
        Result of successful function call
    
    Raises:
        MaxRetriesExceeded: When max retries reached
        Exception: When non-retryable error occurs
    """
    last_error = None
    prev_delay = config.base_delay
    
    for attempt in range(config.max_retries + 1):
        try:
            if attempt > 0:
                delay = _calculate_delay(config, attempt, prev_delay)
                await asyncio.sleep(delay)
                prev_delay = delay
            
            return await fn()
        
        except Exception as e:
            last_error = e
            
            # Check if error is retryable
            if should_retry and not should_retry(e):
                raise
            
            # Don't retry on last attempt
            if attempt == config.max_retries:
                raise MaxRetriesExceeded(
                    f"Max retries ({config.max_retries}) exceeded"
                ) from e
    
    # Should never reach here
    raise last_error

def _calculate_delay(
    config: RetryConfig, 
    attempt: int, 
    prev_delay: float
) -> float:
    """Calculate delay with jitter based on strategy."""
    
    # Exponential component
    exp_delay = config.base_delay * (2 ** (attempt - 1))
    
    if config.jitter == "full":
        # Full jitter: random(0, min(max_delay, exp_delay))
        max_jitter = min(config.max_delay, exp_delay)
        return random.uniform(0, max_jitter)
    
    elif config.jitter == "equal":
        # Equal jitter: temp/2 + random(0, temp/2)
        temp = min(config.max_delay, exp_delay)
        return temp / 2 + random.uniform(0, temp / 2)
    
    elif config.jitter == "decorrelated":
        # Decorrelated: random(base_delay, prev_delay * 3)
        delay = random.uniform(config.base_delay, prev_delay * 3)
        return min(config.max_delay, delay)
    
    else:
        raise ValueError(f"Unknown jitter strategy: {config.jitter}")

# Retryable error checker
def is_retryable(error: Exception) -> bool:
    """Determine if an error should be retried."""
    retryable_types = (
        asyncio.TimeoutError,
        ConnectionError,
        # Add your custom exceptions here
    )
    return isinstance(error, retryable_types)

# Usage example
async def fetch_data(url: str) -> dict:
    """Fetch data with automatic retries."""
    
    async def _fetch():
        async with aiohttp.ClientSession() as session:
            async with session.get(url, timeout=5) as response:
                response.raise_for_status()
                return await response.json()
    
    config = RetryConfig(
        base_delay=0.1,
        max_delay=10.0,
        max_retries=5,
        jitter="full"
    )
    
    return await retry_with_backoff(_fetch, config, is_retryable)

ReactJS Implementation (API Client)

// retry.ts
interface RetryConfig {
  baseDelay: number;
  maxDelay: number;
  maxRetries: number;
  jitter: 'full' | 'equal' | 'decorrelated';
}

type ShouldRetryFn = (error: Error) => boolean;

export class MaxRetriesExceededError extends Error {
  constructor(message: string, public lastError: Error) {
    super(message);
    this.name = 'MaxRetriesExceededError';
  }
}

export async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  config: RetryConfig,
  shouldRetry?: ShouldRetryFn
): Promise<T> {
  let lastError: Error | null = null;
  let prevDelay = config.baseDelay;

  for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
    try {
      if (attempt > 0) {
        const delay = calculateDelay(config, attempt, prevDelay);
        await sleep(delay);
        prevDelay = delay;
      }

      return await fn();
    } catch (error) {
      lastError = error as Error;

      // Check if retryable
      if (shouldRetry && !shouldRetry(lastError)) {
        throw error;
      }

      // Last attempt
      if (attempt === config.maxRetries) {
        throw new MaxRetriesExceededError(
          `Max retries (${config.maxRetries}) exceeded`,
          lastError
        );
      }
    }
  }

  throw lastError;
}

function calculateDelay(
  config: RetryConfig,
  attempt: number,
  prevDelay: number
): number {
  const expDelay = config.baseDelay * Math.pow(2, attempt - 1);

  switch (config.jitter) {
    case 'full': {
      const maxJitter = Math.min(config.maxDelay, expDelay);
      return Math.random() * maxJitter;
    }
    case 'equal': {
      const temp = Math.min(config.maxDelay, expDelay);
      return temp / 2 + Math.random() * (temp / 2);
    }
    case 'decorrelated': {
      const delay = Math.random() * (prevDelay * 3 - config.baseDelay) + config.baseDelay;
      return Math.min(config.maxDelay, delay);
    }
  }
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

// Check if HTTP error is retryable
export function isRetryableHttpError(error: Error): boolean {
  if ('status' in error) {
    const status = (error as any).status;
    // Retry on 5xx server errors and 429 rate limits
    return status >= 500 || status === 429;
  }
  // Retry on network errors
  return error.message.includes('fetch') || error.message.includes('network');
}

// React hook for retryable API calls
import { useState, useCallback } from 'react';

export function useRetryableApi<T>(config?: Partial<RetryConfig>) {
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState<Error | null>(null);

  const defaultConfig: RetryConfig = {
    baseDelay: 100,
    maxDelay: 10000,
    maxRetries: 5,
    jitter: 'full',
    ...config,
  };

  const execute = useCallback(
    async (fn: () => Promise<T>): Promise<T | null> => {
      setLoading(true);
      setError(null);

      try {
        const result = await retryWithBackoff(
          fn,
          defaultConfig,
          isRetryableHttpError
        );
        return result;
      } catch (err) {
        setError(err as Error);
        return null;
      } finally {
        setLoading(false);
      }
    },
    [defaultConfig]
  );

  return { execute, loading, error };
}

Usage in React component:

function DataFetcher() {
  const { execute, loading, error } = useRetryableApi({
    maxRetries: 3,
    baseDelay: 200,
  });

  const [data, setData] = useState(null);

  const fetchData = async () => {
    const result = await execute(() =>
      fetch('/api/data').then(res => {
        if (!res.ok) throw new Error(`HTTP ${res.status}`);
        return res.json();
      })
    );
    
    if (result) {
      setData(result);
    }
  };

  return (
    <div>
      <button onClick={fetchData} disabled={loading}>
        {loading ? 'Loading...' : 'Fetch Data'}
      </button>
      {error && <div>Error: {error.message}</div>}
      {data && <pre>{JSON.stringify(data, null, 2)}</pre>}
    </div>
  );
}

When to Use

Use exponential backoff with jitter when:

Calling external APIs with rate limits
Connecting to databases or message queues
Making network requests in distributed systems
Recovering from transient failures

Don’t use when:

Errors are permanent (404 Not Found, 401 Unauthorized)
Real-time requirements don’t allow delays
Idempotency can’t be guaranteed (could create duplicates)

Trade-offs

Advantages:

Prevents thundering herds and retry storms
Gives services time to recover
Adapts to varying load conditions
Simple to implement and reason about

Disadvantages:

Increases latency for failed requests
Complexity in determining retry budgets
May hide underlying issues if over-used
Requires careful tuning for optimal parameters

Best Practices

Set reasonable max delays: 30-60 seconds prevents infinite waits
Limit total retries: 3-5 attempts is usually sufficient
Use circuit breakers: Combine with circuit breakers to stop retrying failing services
Log retry attempts: Monitor retry rates to detect systemic issues
Make operations idempotent: Ensure retries don’t create duplicate side effects
Respect retry budgets: Don’t let retries consume all request timeout budget
Honor server backoff signals: If server provides Retry-After header, use it

Conclusion

Exponential backoff with jitter is a simple yet powerful pattern for building resilient distributed systems. By spreading retries over time and adding randomness, you protect downstream services while maximizing success rates. Implement this pattern consistently across your service mesh, and you’ll see fewer cascading failures and faster recovery from transient issues.

2025-11-04

../