Retries with Exponential Backoff and Jitter

Retries with Exponential Backoff and Jitter

Overview

In distributed systems, transient failures are inevitable - network hiccups, temporary service overload, rate limiting, database connection timeouts. The naive approach of immediately retrying failed requests can worsen the problem, creating thundering herds that amplify outages. Exponential backoff with jitter is a proven pattern that makes retries both resilient and respectful of downstream services.

The Problem: Thundering Herds

When a service experiences a brief outage and recovers, thousands of clients may retry simultaneously, creating a synchronized wave of traffic that overwhelms the recovering service. This retry storm can prevent the service from recovering, creating a self-perpetuating outage.

Example scenario:

  1. A database becomes temporarily unavailable
  2. 10,000 application instances fail simultaneously
  3. All instances retry after 1 second
  4. Database receives 10,000 requests simultaneously
  5. Database overloads and crashes again
  6. Cycle repeats

The Solution: Exponential Backoff with Jitter

Exponential backoff increases retry delays exponentially (1s, 2s, 4s, 8s…) to give services time to recover. Jitter adds randomness to prevent synchronized retries.

Key Components

  1. Base delay: Initial retry delay (e.g., 100ms)
  2. Multiplier: How much to increase each retry (typically 2x)
  3. Max delay: Upper bound to prevent infinite waits (e.g., 30s)
  4. Max attempts: Give up after N retries
  5. Jitter strategy: How to add randomness

Jitter Strategies

Full Jitter (Recommended):

delay = random(0, min(max_delay, base_delay * 2^attempt))

Provides maximum distribution, best for high-concurrency scenarios.

Equal Jitter:

temp = min(max_delay, base_delay * 2^attempt)
delay = temp/2 + random(0, temp/2)

Guarantees a minimum wait while adding randomness.

Decorrelated Jitter:

delay = min(max_delay, random(base_delay, prev_delay * 3))

Creates more variation between retries, further reducing correlation.

Implementation Examples

Go Implementation

package retry

import (
    "context"
    "errors"
    "math"
    "math/rand"
    "time"
)

type Config struct {
    BaseDelay  time.Duration
    MaxDelay   time.Duration
    MaxRetries int
}

// RetryableFunc is a function that can be retried
type RetryableFunc func(ctx context.Context) error

// ShouldRetry determines if an error is retryable
type ShouldRetry func(error) bool

// WithExponentialBackoff retries a function with exponential backoff and jitter
func WithExponentialBackoff(
    ctx context.Context,
    cfg Config,
    shouldRetry ShouldRetry,
    fn RetryableFunc,
) error {
    var lastErr error
    
    for attempt := 0; attempt <= cfg.MaxRetries; attempt++ {
        // First attempt doesn't need delay
        if attempt > 0 {
            delay := calculateDelay(cfg, attempt)
            
            select {
            case <-time.After(delay):
            case <-ctx.Done():
                return ctx.Err()
            }
        }
        
        lastErr = fn(ctx)
        
        // Success
        if lastErr == nil {
            return nil
        }
        
        // Non-retryable error
        if !shouldRetry(lastErr) {
            return lastErr
        }
        
        // Context cancelled/deadline exceeded
        if ctx.Err() != nil {
            return ctx.Err()
        }
    }
    
    return errors.New("max retries exceeded: " + lastErr.Error())
}

// calculateDelay implements full jitter exponential backoff
func calculateDelay(cfg Config, attempt int) time.Duration {
    // Calculate exponential delay
    expDelay := float64(cfg.BaseDelay) * math.Pow(2, float64(attempt-1))
    
    // Cap at max delay
    maxJitter := math.Min(expDelay, float64(cfg.MaxDelay))
    
    // Apply full jitter: random between 0 and maxJitter
    jitter := rand.Float64() * maxJitter
    
    return time.Duration(jitter)
}

// Common retryable error patterns
func IsRetryable(err error) bool {
    if err == nil {
        return false
    }
    
    // Retryable conditions
    switch {
    case errors.Is(err, context.DeadlineExceeded):
        return true
    case errors.Is(err, ErrServiceUnavailable):
        return true
    case errors.Is(err, ErrRateLimited):
        return true
    case errors.Is(err, ErrConnectionFailed):
        return true
    default:
        return false
    }
}

Usage example:

func FetchData(ctx context.Context, url string) ([]byte, error) {
    var result []byte
    
    cfg := retry.Config{
        BaseDelay:  100 * time.Millisecond,
        MaxDelay:   10 * time.Second,
        MaxRetries: 5,
    }
    
    err := retry.WithExponentialBackoff(
        ctx,
        cfg,
        retry.IsRetryable,
        func(ctx context.Context) error {
            var err error
            result, err = httpClient.Get(ctx, url)
            return err
        },
    )
    
    return result, err
}

Python Implementation

import asyncio
import random
from typing import Callable, Optional, TypeVar
from dataclasses import dataclass

T = TypeVar('T')

@dataclass
class RetryConfig:
    base_delay: float = 0.1  # 100ms
    max_delay: float = 10.0  # 10 seconds
    max_retries: int = 5
    jitter: str = "full"  # "full", "equal", or "decorrelated"

class MaxRetriesExceeded(Exception):
    pass

async def retry_with_backoff(
    fn: Callable[..., T],
    config: RetryConfig,
    should_retry: Optional[Callable[[Exception], bool]] = None,
) -> T:
    """
    Retry a function with exponential backoff and jitter.
    
    Args:
        fn: Async function to retry
        config: Retry configuration
        should_retry: Function to determine if error is retryable
    
    Returns:
        Result of successful function call
    
    Raises:
        MaxRetriesExceeded: When max retries reached
        Exception: When non-retryable error occurs
    """
    last_error = None
    prev_delay = config.base_delay
    
    for attempt in range(config.max_retries + 1):
        try:
            if attempt > 0:
                delay = _calculate_delay(config, attempt, prev_delay)
                await asyncio.sleep(delay)
                prev_delay = delay
            
            return await fn()
        
        except Exception as e:
            last_error = e
            
            # Check if error is retryable
            if should_retry and not should_retry(e):
                raise
            
            # Don't retry on last attempt
            if attempt == config.max_retries:
                raise MaxRetriesExceeded(
                    f"Max retries ({config.max_retries}) exceeded"
                ) from e
    
    # Should never reach here
    raise last_error

def _calculate_delay(
    config: RetryConfig, 
    attempt: int, 
    prev_delay: float
) -> float:
    """Calculate delay with jitter based on strategy."""
    
    # Exponential component
    exp_delay = config.base_delay * (2 ** (attempt - 1))
    
    if config.jitter == "full":
        # Full jitter: random(0, min(max_delay, exp_delay))
        max_jitter = min(config.max_delay, exp_delay)
        return random.uniform(0, max_jitter)
    
    elif config.jitter == "equal":
        # Equal jitter: temp/2 + random(0, temp/2)
        temp = min(config.max_delay, exp_delay)
        return temp / 2 + random.uniform(0, temp / 2)
    
    elif config.jitter == "decorrelated":
        # Decorrelated: random(base_delay, prev_delay * 3)
        delay = random.uniform(config.base_delay, prev_delay * 3)
        return min(config.max_delay, delay)
    
    else:
        raise ValueError(f"Unknown jitter strategy: {config.jitter}")

# Retryable error checker
def is_retryable(error: Exception) -> bool:
    """Determine if an error should be retried."""
    retryable_types = (
        asyncio.TimeoutError,
        ConnectionError,
        # Add your custom exceptions here
    )
    return isinstance(error, retryable_types)

# Usage example
async def fetch_data(url: str) -> dict:
    """Fetch data with automatic retries."""
    
    async def _fetch():
        async with aiohttp.ClientSession() as session:
            async with session.get(url, timeout=5) as response:
                response.raise_for_status()
                return await response.json()
    
    config = RetryConfig(
        base_delay=0.1,
        max_delay=10.0,
        max_retries=5,
        jitter="full"
    )
    
    return await retry_with_backoff(_fetch, config, is_retryable)

ReactJS Implementation (API Client)

// retry.ts
interface RetryConfig {
  baseDelay: number;
  maxDelay: number;
  maxRetries: number;
  jitter: 'full' | 'equal' | 'decorrelated';
}

type ShouldRetryFn = (error: Error) => boolean;

export class MaxRetriesExceededError extends Error {
  constructor(message: string, public lastError: Error) {
    super(message);
    this.name = 'MaxRetriesExceededError';
  }
}

export async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  config: RetryConfig,
  shouldRetry?: ShouldRetryFn
): Promise<T> {
  let lastError: Error | null = null;
  let prevDelay = config.baseDelay;

  for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
    try {
      if (attempt > 0) {
        const delay = calculateDelay(config, attempt, prevDelay);
        await sleep(delay);
        prevDelay = delay;
      }

      return await fn();
    } catch (error) {
      lastError = error as Error;

      // Check if retryable
      if (shouldRetry && !shouldRetry(lastError)) {
        throw error;
      }

      // Last attempt
      if (attempt === config.maxRetries) {
        throw new MaxRetriesExceededError(
          `Max retries (${config.maxRetries}) exceeded`,
          lastError
        );
      }
    }
  }

  throw lastError;
}

function calculateDelay(
  config: RetryConfig,
  attempt: number,
  prevDelay: number
): number {
  const expDelay = config.baseDelay * Math.pow(2, attempt - 1);

  switch (config.jitter) {
    case 'full': {
      const maxJitter = Math.min(config.maxDelay, expDelay);
      return Math.random() * maxJitter;
    }
    case 'equal': {
      const temp = Math.min(config.maxDelay, expDelay);
      return temp / 2 + Math.random() * (temp / 2);
    }
    case 'decorrelated': {
      const delay = Math.random() * (prevDelay * 3 - config.baseDelay) + config.baseDelay;
      return Math.min(config.maxDelay, delay);
    }
  }
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

// Check if HTTP error is retryable
export function isRetryableHttpError(error: Error): boolean {
  if ('status' in error) {
    const status = (error as any).status;
    // Retry on 5xx server errors and 429 rate limits
    return status >= 500 || status === 429;
  }
  // Retry on network errors
  return error.message.includes('fetch') || error.message.includes('network');
}

// React hook for retryable API calls
import { useState, useCallback } from 'react';

export function useRetryableApi<T>(config?: Partial<RetryConfig>) {
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState<Error | null>(null);

  const defaultConfig: RetryConfig = {
    baseDelay: 100,
    maxDelay: 10000,
    maxRetries: 5,
    jitter: 'full',
    ...config,
  };

  const execute = useCallback(
    async (fn: () => Promise<T>): Promise<T | null> => {
      setLoading(true);
      setError(null);

      try {
        const result = await retryWithBackoff(
          fn,
          defaultConfig,
          isRetryableHttpError
        );
        return result;
      } catch (err) {
        setError(err as Error);
        return null;
      } finally {
        setLoading(false);
      }
    },
    [defaultConfig]
  );

  return { execute, loading, error };
}

Usage in React component:

function DataFetcher() {
  const { execute, loading, error } = useRetryableApi({
    maxRetries: 3,
    baseDelay: 200,
  });

  const [data, setData] = useState(null);

  const fetchData = async () => {
    const result = await execute(() =>
      fetch('/api/data').then(res => {
        if (!res.ok) throw new Error(`HTTP ${res.status}`);
        return res.json();
      })
    );
    
    if (result) {
      setData(result);
    }
  };

  return (
    <div>
      <button onClick={fetchData} disabled={loading}>
        {loading ? 'Loading...' : 'Fetch Data'}
      </button>
      {error && <div>Error: {error.message}</div>}
      {data && <pre>{JSON.stringify(data, null, 2)}</pre>}
    </div>
  );
}

When to Use

Use exponential backoff with jitter when:

Don’t use when:

Trade-offs

Advantages:

Disadvantages:

Best Practices

  1. Set reasonable max delays: 30-60 seconds prevents infinite waits
  2. Limit total retries: 3-5 attempts is usually sufficient
  3. Use circuit breakers: Combine with circuit breakers to stop retrying failing services
  4. Log retry attempts: Monitor retry rates to detect systemic issues
  5. Make operations idempotent: Ensure retries don’t create duplicate side effects
  6. Respect retry budgets: Don’t let retries consume all request timeout budget
  7. Honor server backoff signals: If server provides Retry-After header, use it

Conclusion

Exponential backoff with jitter is a simple yet powerful pattern for building resilient distributed systems. By spreading retries over time and adding randomness, you protect downstream services while maximizing success rates. Implement this pattern consistently across your service mesh, and you’ll see fewer cascading failures and faster recovery from transient issues.