Saga Pattern for Distributed Transactions
Saga Pattern for Distributed Transactions
Overview
The Saga pattern is a design pattern for managing distributed transactions across multiple microservices without relying on traditional two-phase commit protocols. Instead of a single atomic transaction, a saga breaks down a business transaction into a sequence of local transactions, each with a corresponding compensating transaction to undo changes if something goes wrong.
In modern distributed systems, maintaining ACID properties across service boundaries is extremely challenging and often impossible. The Saga pattern provides an alternative approach that embraces eventual consistency while maintaining business logic integrity.
When to Use
The Saga pattern is ideal when:
- Multiple microservices need to participate in a single business transaction
- Strong consistency across services is not feasible or would harm performance
- Long-running transactions spanning seconds or minutes are common
- Service autonomy is prioritized over strict data consistency
- Compensating actions can be defined for each step in the process
Avoid this pattern when:
- Transactions complete in milliseconds and touch a single database
- Strong consistency is a hard business requirement
- Compensating transactions are impossible to define
- The overhead of saga coordination outweighs the benefits
Two Approaches: Choreography vs Orchestration
Choreography-Based Saga
Services emit events and listen to events from other services, coordinating through a message bus without a central controller.
Pros:
- No single point of failure
- Loose coupling between services
- Scales well with many services
Cons:
- Harder to understand flow and debug
- Cyclic dependencies between services
- Challenging to track saga state
Orchestration-Based Saga
A central orchestrator explicitly calls services and manages the saga flow.
Pros:
- Clearer business logic flow
- Easier to test and debug
- Centralized saga state management
Cons:
- Orchestrator is a single point of failure
- Tighter coupling between orchestrator and services
- Orchestrator can become a bottleneck
Implementation Example: Order Processing with Go
Let’s implement an orchestration-based saga for an e-commerce order that involves:
- Reserve inventory
- Process payment
- Create shipment
package saga
import (
"context"
"fmt"
)
// SagaStep represents a single step in the saga
type SagaStep struct {
Name string
Action func(ctx context.Context, data interface{}) error
Compensate func(ctx context.Context, data interface{}) error
}
// Saga orchestrates a series of steps with compensation
type Saga struct {
steps []SagaStep
}
// NewSaga creates a new saga instance
func NewSaga() *Saga {
return &Saga{
steps: make([]SagaStep, 0),
}
}
// AddStep adds a step to the saga
func (s *Saga) AddStep(step SagaStep) *Saga {
s.steps = append(s.steps, step)
return s
}
// Execute runs the saga, compensating on failure
func (s *Saga) Execute(ctx context.Context, data interface{}) error {
completedSteps := make([]int, 0)
// Execute forward flow
for i, step := range s.steps {
fmt.Printf("Executing step: %s\n", step.Name)
if err := step.Action(ctx, data); err != nil {
fmt.Printf("Step %s failed: %v\n", step.Name, err)
// Compensate in reverse order
s.compensate(ctx, data, completedSteps)
return fmt.Errorf("saga failed at step %s: %w", step.Name, err)
}
completedSteps = append(completedSteps, i)
}
fmt.Println("Saga completed successfully")
return nil
}
// compensate runs compensation in reverse order
func (s *Saga) compensate(ctx context.Context, data interface{}, completedSteps []int) {
fmt.Println("Starting compensation...")
// Compensate in reverse order
for i := len(completedSteps) - 1; i >= 0; i-- {
step := s.steps[completedSteps[i]]
fmt.Printf("Compensating step: %s\n", step.Name)
if err := step.Compensate(ctx, data); err != nil {
// Log compensation failure - this is critical
fmt.Printf("CRITICAL: Compensation failed for %s: %v\n", step.Name, err)
// In production, alert operations team
}
}
}
// OrderData represents the saga data
type OrderData struct {
OrderID string
UserID string
Items []string
PaymentID string
ShipmentID string
InventoryReserved bool
}
// Example usage with order processing
func ProcessOrder(ctx context.Context, order *OrderData) error {
saga := NewSaga().
AddStep(SagaStep{
Name: "ReserveInventory",
Action: func(ctx context.Context, data interface{}) error {
order := data.(*OrderData)
// Call inventory service
fmt.Printf("Reserving inventory for order %s\n", order.OrderID)
// Simulate success
order.InventoryReserved = true
return nil
},
Compensate: func(ctx context.Context, data interface{}) error {
order := data.(*OrderData)
fmt.Printf("Releasing inventory for order %s\n", order.OrderID)
// Call inventory service to release
order.InventoryReserved = false
return nil
},
}).
AddStep(SagaStep{
Name: "ProcessPayment",
Action: func(ctx context.Context, data interface{}) error {
order := data.(*OrderData)
fmt.Printf("Processing payment for order %s\n", order.OrderID)
// Call payment service
order.PaymentID = "pay_12345"
return nil
},
Compensate: func(ctx context.Context, data interface{}) error {
order := data.(*OrderData)
fmt.Printf("Refunding payment %s\n", order.PaymentID)
// Call payment service to refund
return nil
},
}).
AddStep(SagaStep{
Name: "CreateShipment",
Action: func(ctx context.Context, data interface{}) error {
order := data.(*OrderData)
fmt.Printf("Creating shipment for order %s\n", order.OrderID)
// Call shipping service
order.ShipmentID = "ship_67890"
return nil
},
Compensate: func(ctx context.Context, data interface{}) error {
order := data.(*OrderData)
fmt.Printf("Cancelling shipment %s\n", order.ShipmentID)
// Call shipping service to cancel
return nil
},
})
return saga.Execute(ctx, order)
}
Python Implementation with Async/Await
For Python-based microservices, here’s an async saga implementation:
from typing import Callable, List, Any
from dataclasses import dataclass
import asyncio
@dataclass
class SagaStep:
name: str
action: Callable
compensate: Callable
class AsyncSaga:
def __init__(self):
self.steps: List[SagaStep] = []
def add_step(self, step: SagaStep) -> 'AsyncSaga':
self.steps.append(step)
return self
async def execute(self, context: dict) -> bool:
completed_steps = []
try:
for step in self.steps:
print(f"Executing: {step.name}")
await step.action(context)
completed_steps.append(step)
print("Saga completed successfully")
return True
except Exception as e:
print(f"Saga failed: {e}")
await self._compensate(context, completed_steps)
return False
async def _compensate(self, context: dict, completed_steps: List[SagaStep]):
print("Starting compensation...")
for step in reversed(completed_steps):
try:
print(f"Compensating: {step.name}")
await step.compensate(context)
except Exception as e:
print(f"CRITICAL: Compensation failed for {step.name}: {e}")
# Alert operations team
# Example usage
async def reserve_inventory(context: dict):
print(f"Reserving inventory for order {context['order_id']}")
# Call inventory service
context['inventory_reserved'] = True
async def release_inventory(context: dict):
print(f"Releasing inventory for order {context['order_id']}")
context['inventory_reserved'] = False
async def process_payment(context: dict):
print(f"Processing payment for order {context['order_id']}")
context['payment_id'] = 'pay_12345'
async def refund_payment(context: dict):
print(f"Refunding payment {context['payment_id']}")
Key Trade-offs
Benefits
- Service autonomy: Each service owns its data and logic
- Scalability: No distributed locks or global transactions
- Resilience: Failure in one service doesn’t block others indefinitely
- Flexibility: Easy to add new steps or modify existing flows
Challenges
- Eventual consistency: System may be temporarily inconsistent
- Complexity: More code and coordination logic required
- Compensating transactions: Must be carefully designed and idempotent
- Debugging: Harder to trace and understand failures across services
- Partial failures: The system must handle inconsistent states gracefully
Best Practices
- Make compensations idempotent: Compensation may be called multiple times
- Log everything: Comprehensive logging is essential for debugging distributed sagas
- Use correlation IDs: Track the entire saga flow across services
- Handle compensation failures: Have alerts and manual intervention procedures
- Design for failure: Assume any step can fail at any time
- Timeout management: Set appropriate timeouts for each step
- Monitoring: Track saga success rates, duration, and failure points
- State persistence: Store saga state to recover from orchestrator failures
Monitoring and Observability
type SagaMetrics struct {
StartTime time.Time
CompletedSteps int
FailedStep string
TotalDuration time.Duration
}
func (s *Saga) ExecuteWithMetrics(ctx context.Context, data interface{}) (error, SagaMetrics) {
metrics := SagaMetrics{
StartTime: time.Now(),
}
err := s.Execute(ctx, data)
metrics.TotalDuration = time.Since(metrics.StartTime)
// Record metrics to monitoring system
return err, metrics
}
Conclusion
The Saga pattern is essential for building reliable distributed systems that need to coordinate business transactions across multiple microservices. While it introduces complexity through compensation logic and eventual consistency, it provides the scalability and resilience required for modern cloud-native applications. Choose orchestration for simpler flows with clear business logic, and choreography for loosely-coupled systems requiring high autonomy.