# CLAUDE.md

Obsidian: Projects/karpenter-ibm

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

This is the Karpenter Provider for IBM Cloud, which implements Karpenter's node provisioning functionality for IBM Cloud VPC infrastructure. The project is written in Go and follows Kubernetes operator patterns. It supports both IKS (IBM Kubernetes Service) and VPC deployment models.

## Development Tools

- We use kubebuilder for RBAC management
- controller-gen for CRD and code generation
- golangci-lint for code linting
- ginkgo/gomega for testing framework
- ko for container image building

## Build and Testing

- Use `make build` to build the controller binary
- Use `make test` or `make unit` to run unit tests
- Use `make ci` to run full CI pipeline (tests + linting)
- Use `make lint` to run golangci-lint
- Use `make generate` to generate CRDs and code
- Use `make vendor` to update dependencies

## SSH and Node Access

- ssh into the node by attaching a floating ip and using ~/.ssh/eb
- we have @./scripts/troubleshoot-node.sh to help us debug nodes over ssh

## Cluster Specific Notes

- We have a function to detect whether the cluster is IKS or VPC
- IKS doesn't need to join nodes like other clusters since we just scale the node group
- Zone validation is implemented for subnets to ensure proper placement

## Kubectl Commands

- You NEED to use @kubeconfig when running kubectl commands

## Troubleshooting

- Always make sure the latest operator version is deployed when troubleshooting live
- Use @./scripts/must-gather.sh to collect diagnostic information for support requests (automatically sanitizes sensitive data)
- Additional utility scripts:
  - cleanup-unused-vnis.sh - Clean up unused virtual network interfaces
  - generate-k8s-support-table.sh - Generate Kubernetes support matrix

## HTTP Client

- We have a centralized HTTP client in pkg/httpclient for IBM Cloud API operations
- This provides consistent authentication and request handling across IBM Cloud services

## Endpoints

- remember the internal endpoint

## Code Comments

- When adding comments, avoid being too verbose
- Comments should be helpful to developers but not include:
  - Historic change context (e.g., "this now uses x")
  - Prompt artifacts (e.g., "instead of y")
  - Tests for previously uncovered functions is a bad comment

## Dependencies

- Current Go version: 1.24.6 (runtime: 1.25.2)
- Key dependencies:
  - github.com/IBM/vpc-go-sdk v0.74.0
  - github.com/IBM/platform-services-go-sdk v0.88.0
  - sigs.k8s.io/karpenter v1.7.1
  - k8s.io/api v0.35.0-alpha.0

## Recent Features

- Zone subnet validation for proper node placement
- Improved test coverage and CI/CD pipeline (87 unit tests, 11 integration tests)
- HTTP client abstraction for IBM Cloud APIs
- Enhanced bootstrap methods for different deployment scenarios
- Must-gather diagnostic script with automatic data sanitization
- Refined RBAC permissions for security best practices
- Updated dependencies (Karpenter v1.7.1, VPC SDK v0.74.0)

## Current Code Quality Status (October 2025)

### Resolved Issues (from previous reviews)
- All panic statements removed from production code
- Structured logging implemented (mostly - see outstanding issues)
- Architecture detection now uses VPC API dynamically (supports amd64/s390x)
- Context.TODO() usage replaced with proper context propagation
- Godoc coverage at 95%+ for exported functions

### Outstanding Issues Requiring Attention
1. ~~**Debug logging in vpc.go**: fmt.Printf statements for VPC-DEBUG (lines 264-282) - should use structured logging~~ ✅ FIXED (September 2025)
2. ~~**RBAC permissions**: ClusterRoleBinding management permissions overly broad~~ ✅ FIXED (October 2025) - Limited to necessary verbs only
3. **Magic values**: 30*time.Minute appears 8 times across codebase - needs constant extraction
4. **TODO comment**: Subnet tag extraction pending IBM SDK support (pkg/providers/vpc/subnet/provider.go:430)

### Test Coverage
- 87 unit test files in pkg/
- 11 integration test files in test/
- Comprehensive mock infrastructure
- All test suites enabled and passing

## Code Design Principles

- We should not add unnecessary fallback mechanisms that will reasonably never be used or use hardcoded values instead of fetching from a live api
- Use the centralized HTTP client for IBM Cloud API calls
- Follow existing patterns for controller implementations

## Testing Principles

- Never simplify tests just to make them pass. The goal is to verify underlying functionality is solid
- Use ginkgo/gomega for test structure
- Maintain high test coverage

## Function Updates

- never add a NewFunction if Function exists. update the existing one
- don't use New in function names unless it really is justified (e.g. NewDatabase, which creates a new database) but never if we're migrating to a new approach

## Git Commit Guidelines

- don't sign off commits as claude
- use @kubecofig_test for br cluster
- br cluster has test workloads on there already to trigger scaling

## Karpenter Core Features - AWS Provider Comparison (September 2025)

### Features IBM Provider Should Better Utilize

Based on comparison with AWS provider reference, the following Karpenter core features could be better leveraged:

#### 1. **API Call Batching** ❌ Missing
- AWS has a sophisticated `pkg/batcher` for batching identical API calls
- Reduces API throttling and improves performance
- IBM could benefit from batching VPC API calls (instance lists, subnet queries)

#### 2. **Advanced Drift Detection** ⚠️ Basic Implementation
- IBM has basic drift detection via NodeClass hash comparison
- AWS implements more sophisticated drift checks including:
  - AMI drift detection
  - Security group changes
  - Subnet changes
  - Instance profile modifications
- Consider expanding drift detection capabilities

#### 3. **Disruption Reasons** ❌ Not Implemented
- IBM's `DisruptionReasons()` returns nil
- AWS provides detailed disruption reasons for better observability
- Should implement specific disruption reasons for IBM Cloud scenarios

#### 4. **Repair Policies** ✅ Implemented
- IBM has good implementation matching AWS patterns
- Covers NodeReady, NodeMemoryPressure, NodeDiskPressure conditions

#### 5. **Interruption Handling** ✅ Implemented
- Both have interruption controllers
- IBM could enhance with more IBM-specific interruption events

#### 6. **Capacity Reservation Support** ❌ Missing
- AWS has dedicated capacity reservation provider
- IBM could benefit from reserved instance/dedicated host support

#### 7. **Metrics Collection** ⚠️ Basic
- IBM has basic metrics but AWS has more comprehensive coverage
- Consider adding metrics for:
  - Batched API call efficiency
  - Drift detection events
  - Disruption reason breakdown

### Recommendations Priority
1. **High**: Implement API call batching to reduce VPC API load
2. **High**: Enhance drift detection for better node lifecycle management
3. **Medium**: Add disruption reasons for improved debugging
4. **Medium**: Add capacity reservation support if IBM Cloud offers it
5. **Low**: Expand metrics collection for better observability
- waiting for circuitbreaker recovery is conceptually wrong because it is truggered by provisioning failures. there is plenty of telemetry available to troubleshoot the root cause of provisioning failures
- absolutely don't try to migrate away from VNIs, primary netw attachemnts are deprecated