# Circuit Breaker Protection ## Overview The Karpenter IBM Cloud Provider includes a circuit breaker implementation to prevent scale-up storms and protect against cascading failures during node provisioning. ## Problem Statement In production environments, bootstrap failures or API issues can cause Karpenter to continuously create new instances that fail to join the cluster. ## Circuit Breaker Implementation ### Core Components The circuit breaker is implemented in `/pkg/cloudprovider/circuitbreaker.go` and provides: 1. **State Management**: CLOSED → OPEN → HALF_OPEN transitions 2. **Rate Limiting**: Maximum instances per minute 3. **Concurrency Control**: Maximum concurrent provisioning operations 4. **Failure Detection**: Configurable failure thresholds 5. **Automatic Recovery**: Time-based recovery with testing ### Configuration ```go type CircuitBreakerConfig struct { FailureThreshold int // 3 consecutive failures FailureWindow time.Duration // Within 5 minutes RecoveryTimeout time.Duration // Wait 15 minutes before retry HalfOpenMaxRequests int // Allow 2 test requests RateLimitPerMinute int // Max 2 instances/minute MaxConcurrentInstances int // Max 5 concurrent provisions } ``` ### Default Configuration ```go func DefaultCircuitBreakerConfig() *CircuitBreakerConfig { return &CircuitBreakerConfig{ FailureThreshold: 3, // 3 consecutive failures FailureWindow: 5 * time.Minute, // Within 5 minutes RecoveryTimeout: 15 * time.Minute, // Wait 15 minutes before retry HalfOpenMaxRequests: 2, // Allow 2 test requests RateLimitPerMinute: 2, // Max 2 instances/minute MaxConcurrentInstances: 5, // Max 5 concurrent provisions } } ``` ## Circuit Breaker States ### CLOSED (Normal Operation) - All provisioning requests allowed - Failures are tracked but don't block requests - Rate limiting and concurrency limits still apply ### OPEN (Failing Fast) - All provisioning requests blocked - Returns `CircuitBreakerError` immediately - Automatically transitions to HALF_OPEN after `RecoveryTimeout` ### HALF_OPEN (Testing Recovery) - Limited number of test requests allowed (`HalfOpenMaxRequests`) - Success → transitions to CLOSED - Failure → transitions back to OPEN ## Protection Mechanisms ### 1. Rate Limiting ```go // Prevents more than N instances per minute if cb.instancesThisMinute >= cb.config.RateLimitPerMinute { return &RateLimitError{ Limit: cb.config.RateLimitPerMinute, Current: cb.instancesThisMinute, TimeToReset: time.Minute - time.Since(cb.lastMinuteReset), } } ``` ### 2. Concurrency Control ```go // Prevents too many simultaneous provisioning operations if cb.concurrentInstances >= cb.config.MaxConcurrentInstances { return &ConcurrencyLimitError{ Limit: cb.config.MaxConcurrentInstances, Current: cb.concurrentInstances, } } ``` ### 3. Failure Detection ```go // Opens circuit after threshold failures within window if recentFailures >= cb.config.FailureThreshold { cb.transitionToOpen() } ``` ## Integration with CloudProvider The circuit breaker is integrated into the main provisioning flow: ```go // Check circuit breaker before provisioning if err := c.circuitBreaker.CanProvision(ctx, nodeClass.Name, nodeClass.Spec.Region, 0); err != nil { return nil, cloudprovider.NewInsufficientCapacityError(fmt.Errorf("circuit breaker blocked provisioning: %w", err)) } // Record success/failure after provisioning if err != nil { c.circuitBreaker.RecordFailure(nodeClass.Name, nodeClass.Spec.Region, err) } else { c.circuitBreaker.RecordSuccess(nodeClass.Name, nodeClass.Spec.Region) } ``` ## Error Types ### CircuitBreakerError ```go type CircuitBreakerError struct { State CircuitBreakerState Message string TimeToWait time.Duration } ``` ### RateLimitError ```go type RateLimitError struct { Limit int Current int TimeToReset time.Duration } ``` ### ConcurrencyLimitError ```go type ConcurrencyLimitError struct { Limit int Current int } ``` ## Monitoring and Observability ### Circuit Breaker Status ```go type CircuitBreakerStatus struct { State CircuitBreakerState RecentFailures int FailureThreshold int InstancesThisMinute int RateLimit int ConcurrentInstances int MaxConcurrent int LastStateChange time.Time TimeToRecovery time.Duration } ``` ### Logging The circuit breaker provides logging for: - State transitions - Provisioning attempts (allowed/blocked) - Success/failure recording - Rate limit and concurrency violations ## Testing ### Test Coverage - 15 comprehensive test cases in `/pkg/cloudprovider/circuitbreaker_test.go` - Scale-up storm prevention simulation - State transition testing - Concurrent access testing - Error scenario coverage ### Example Test: Scale-Up Storm Prevention ```go func TestCircuitBreaker_ScaleUpStormPrevention(t *testing.T) { config := &CircuitBreakerConfig{ FailureThreshold: 3, FailureWindow: 5 * time.Minute, RecoveryTimeout: 15 * time.Minute, HalfOpenMaxRequests: 2, RateLimitPerMinute: 2, // This prevents 90 instances in 15 minutes MaxConcurrentInstances: 5, } // Simulate rapid provisioning attempts for i := 0; i < 10; i++ { err := cb.CanProvision(ctx, "test-nodeclass", "us-south", 0) if err == nil { // Simulate bootstrap failures cb.RecordFailure("test-nodeclass", "us-south", fmt.Errorf("bootstrap failure %d", i)) } } // Should be limited to 2 successful provisions per minute assert.Equal(t, 2, successfulProvisions) } ``` ## Configuration Examples ### Conservative (High Availability) ```go config := &CircuitBreakerConfig{ FailureThreshold: 2, // Open after 2 failures FailureWindow: 3 * time.Minute, // Within 3 minutes RecoveryTimeout: 30 * time.Minute, // Wait 30 minutes HalfOpenMaxRequests: 1, // Only 1 test request RateLimitPerMinute: 1, // Max 1 instance/minute MaxConcurrentInstances: 3, // Max 3 concurrent } ``` ### Aggressive (Development) ```go config := &CircuitBreakerConfig{ FailureThreshold: 5, // Open after 5 failures FailureWindow: 10 * time.Minute, // Within 10 minutes RecoveryTimeout: 5 * time.Minute, // Wait 5 minutes HalfOpenMaxRequests: 5, // Allow 5 test requests RateLimitPerMinute: 5, // Max 5 instances/minute MaxConcurrentInstances: 10, // Max 10 concurrent } ``` ## Best Practices 1. **Monitor Circuit Breaker State**: Track state transitions and failure patterns 2. **Configure for Environment**: Use conservative settings for production 3. **Alert on Circuit Open**: Set up alerts when circuit breaker opens 4. **Review Failure Patterns**: Analyze failure records to identify root causes 5. **Test Recovery**: Verify circuit breaker recovery works as expected ## Troubleshooting ### Circuit Breaker Stuck Open - Check failure threshold configuration - Verify recovery timeout is appropriate - Review recent failure records - Ensure underlying issues are resolved ### Rate Limiting Too Restrictive - Adjust `RateLimitPerMinute` based on workload requirements - Consider burst provisioning needs - Monitor provisioning patterns ### Concurrency Issues - Adjust `MaxConcurrentInstances` based on IBM Cloud quotas - Consider VPC limits and API rate limits - Monitor concurrent provisioning operations ## Related Documentation - [Troubleshooting](troubleshooting.md) - General troubleshooting guide - [Limitations](limitations.md) - IBM Cloud platform limitations - [VPC Integration](vpc-integration.md) - VPC-specific configuration