canary-deployment

aj-geddes

Updated Today

56 views

Otherautomation

About

This Claude Skill enables gradual canary deployments by rolling out new versions to subsets of users while monitoring metrics. It automatically triggers rollbacks when issues are detected based on predefined thresholds, minimizing user impact. Use it for low-risk rollouts, real-world testing with live traffic, and metrics-driven deployment strategies.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/aj-geddes/useful-ai-prompts

Git CloneAlternative

git clone https://github.com/aj-geddes/useful-ai-prompts.git ~/.claude/skills/canary-deployment

Copy and paste this command in Claude Code to install this skill

Documentation

Canary Deployment

Overview

Deploy new versions gradually to a small percentage of users, monitor metrics for issues, and automatically rollback or proceed based on predefined thresholds.

When to Use

Low-risk gradual rollouts
Real-world testing with live traffic
Automatic rollback on errors
User impact minimization
A/B testing integration
Metrics-driven deployments
High-traffic services

Implementation Examples

1. Istio-based Canary Deployment

# canary-deployment-istio.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-v1
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: v1
  template:
    metadata:
      labels:
        app: myapp
        version: v1
    spec:
      containers:
        - name: myapp
          image: myrepo/myapp:1.0.0
          ports:
            - containerPort: 8080

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-v2
  namespace: production
spec:
  replicas: 1  # Start with minimal replicas for canary
  selector:
    matchLabels:
      app: myapp
      version: v2
  template:
    metadata:
      labels:
        app: myapp
        version: v2
    spec:
      containers:
        - name: myapp
          image: myrepo/myapp:2.0.0
          ports:
            - containerPort: 8080

---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
  namespace: production
spec:
  hosts:
    - myapp
  http:
    # Canary: 5% to v2, 95% to v1
    - match:
        - headers:
            user-agent:
              regex: ".*Chrome.*"  # Test with Chrome
      route:
        - destination:
            host: myapp
            subset: v2
          weight: 100
      timeout: 10s

    # Default route with traffic split
    - route:
        - destination:
            host: myapp
            subset: v1
          weight: 95
        - destination:
            host: myapp
            subset: v2
          weight: 5
      timeout: 10s

---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: myapp
  namespace: production
spec:
  host: myapp
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 2

  subsets:
    - name: v1
      labels:
        version: v1

    - name: v2
      labels:
        version: v2
      trafficPolicy:
        outlierDetection:
          consecutive5xxErrors: 3
          interval: 30s
          baseEjectionTime: 30s

---
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  progressDeadlineSeconds: 300
  service:
    port: 80

  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 5
    stepWeightPromotion: 10

  metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m

    - name: request-duration
      thresholdRange:
        max: 500
      interval: 30s

  webhooks:
    - name: acceptance-test
      url: http://flagger-loadtester/
      timeout: 30s
      metadata:
        type: smoke
        cmd: "curl -sd 'test' http://myapp-canary/api/test"

    - name: load-test
      url: http://flagger-loadtester/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary/"
        logCmdOutput: "true"

  # Automatic rollback on failure
  skipAnalysis: false

2. Kubernetes Native Canary Script

#!/bin/bash
# canary-rollout.sh - Canary deployment with k8s native tools

set -euo pipefail

NAMESPACE="${1:-production}"
DEPLOYMENT="${2:-myapp}"
NEW_VERSION="${3:-latest}"
CANARY_WEIGHT=10
MAX_WEIGHT=100
STEP_WEIGHT=10
CHECK_INTERVAL=60
MAX_ERROR_RATE=0.05

echo "Starting canary deployment for $DEPLOYMENT with version $NEW_VERSION"

# Get current replicas
CURRENT_REPLICAS=$(kubectl get deployment $DEPLOYMENT -n "$NAMESPACE" \
  -o jsonpath='{.spec.replicas}')
CANARY_REPLICAS=$((CURRENT_REPLICAS / 10 + 1))

echo "Current replicas: $CURRENT_REPLICAS, Canary replicas: $CANARY_REPLICAS"

# Create canary deployment
kubectl set image deployment/${DEPLOYMENT}-canary \
  ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION} \
  -n "$NAMESPACE"

# Scale up canary gradually
CURRENT_WEIGHT=$CANARY_WEIGHT
while [ $CURRENT_WEIGHT -le $MAX_WEIGHT ]; do
  echo "Setting traffic to canary: ${CURRENT_WEIGHT}%"

  # Update ingress or service to split traffic
  kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge \
    -p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":'$((100-CURRENT_WEIGHT))'},{"destination":{"host":"'${DEPLOYMENT}-canary'"},"weight":'${CURRENT_WEIGHT}'}]}]}}'

  # Wait and check metrics
  echo "Monitoring metrics for ${CHECK_INTERVAL}s..."
  sleep $CHECK_INTERVAL

  # Check error rate
  ERROR_RATE=$(kubectl exec -it deployment/${DEPLOYMENT}-canary -n "$NAMESPACE" -- \
    curl -s http://localhost:8080/metrics | grep http_requests_total | \
    awk '{print $2}' || echo "0")

  if (( $(echo "$ERROR_RATE > $MAX_ERROR_RATE" | bc -l) )); then
    echo "ERROR: Error rate exceeded threshold: $ERROR_RATE"
    echo "Rolling back canary deployment..."
    kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge \
      -p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":100}]}]}}'
    exit 1
  fi

  CURRENT_WEIGHT=$((CURRENT_WEIGHT + STEP_WEIGHT))
done

# Promote canary to stable
echo "Canary deployment successful! Promoting to stable..."
kubectl set image deployment/${DEPLOYMENT} \
  ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION} \
  -n "$NAMESPACE"

kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m

echo "Canary deployment complete!"

3. Metrics-Based Canary Analysis

# canary-monitoring.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: canary-analysis
  namespace: production
data:
  analyze.sh: |
    #!/bin/bash
    set -euo pipefail

    CANARY_DEPLOYMENT="${1:-myapp-canary}"
    STABLE_DEPLOYMENT="${2:-myapp-stable}"
    THRESHOLD="${3:-0.05}"  # 5% error rate threshold
    NAMESPACE="production"

    echo "Analyzing canary metrics..."

    # Query Prometheus for metrics
    CANARY_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
      --data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${CANARY_DEPLOYMENT}'"}[5m])' | \
      jq -r '.data.result[0].value[1]' || echo "0")

    STABLE_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
      --data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${STABLE_DEPLOYMENT}'"}[5m])' | \
      jq -r '.data.result[0].value[1]' || echo "0")

    CANARY_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
      --data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${CANARY_DEPLOYMENT}'"}[5m]))' | \
      jq -r '.data.result[0].value[1]' || echo "0")

    STABLE_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
      --data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${STABLE_DEPLOYMENT}'"}[5m]))' | \
      jq -r '.data.result[0].value[1]' || echo "0")

    echo "Canary Error Rate: $CANARY_ERROR_RATE"
    echo "Stable Error Rate: $STABLE_ERROR_RATE"
    echo "Canary P95 Latency: ${CANARY_LATENCY}s"
    echo "Stable P95 Latency: ${STABLE_LATENCY}s"

    # Check if canary is within acceptable range
    if (( $(echo "$CANARY_ERROR_RATE > $THRESHOLD" | bc -l) )); then
      echo "FAIL: Canary error rate exceeds threshold"
      exit 1
    fi

    if (( $(echo "$CANARY_LATENCY > $STABLE_LATENCY * 1.2" | bc -l) )); then
      echo "FAIL: Canary latency is 20% higher than stable"
      exit 1
    fi

    echo "PASS: Canary meets quality criteria"
    exit 0

---
apiVersion: batch/v1
kind: Job
metadata:
  name: canary-analysis
  namespace: production
spec:
  template:
    spec:
      containers:
        - name: analyzer
          image: curlimages/curl:latest
          command:
            - sh
            - -c
            - |
              apk add --no-cache bc jq
              bash /scripts/analyze.sh
          volumeMounts:
            - name: scripts
              mountPath: /scripts
      volumes:
        - name: scripts
          configMap:
            name: canary-analysis
      restartPolicy: Never

4. Automated Canary Promotion

#!/bin/bash
# promote-canary.sh - Automatically promote successful canary

set -euo pipefail

NAMESPACE="${1:-production}"
DEPLOYMENT="${2:-myapp}"
MAX_DURATION="${3:-600}"  # Max 10 minutes for canary

start_time=$(date +%s)

echo "Starting automated canary promotion for $DEPLOYMENT"

while true; do
  current_time=$(date +%s)
  elapsed=$((current_time - start_time))

  if [ $elapsed -gt $MAX_DURATION ]; then
    echo "ERROR: Canary exceeded max duration"
    exit 1
  fi

  # Check canary health
  CANARY_REPLICAS=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \
    -o jsonpath='{.status.readyReplicas}')
  CANARY_DESIRED=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \
    -o jsonpath='{.spec.replicas}')

  if [ "$CANARY_REPLICAS" -ne "$CANARY_DESIRED" ]; then
    echo "Waiting for canary pods to be ready..."
    sleep 10
    continue
  fi

  # Run analysis
  if bash /scripts/analyze.sh "$DEPLOYMENT-canary" "$DEPLOYMENT-stable"; then
    echo "Canary analysis passed! Promoting to stable..."

    # Merge canary into stable
    kubectl set image deployment/${DEPLOYMENT} \
      ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \
        -o jsonpath='{.spec.template.spec.containers[0].image}' | cut -d: -f2) \
      -n "$NAMESPACE"

    kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m

    echo "Canary promoted successfully!"
    exit 0
  else
    echo "Canary analysis failed. Rolling back..."
    exit 1
  fi
done

Canary Best Practices

✅ DO

Start with small traffic percentage (5-10%)
Monitor key metrics continuously
Increase gradually based on metrics
Implement automatic rollback
Run load tests on canary
Test with real user traffic
Set appropriate thresholds
Document rollback procedures

❌ DON'T

Rush through canary phases
Ignore metrics
Mix canary and stable versions
Deploy to all users at once
Skip rollback testing
Use artificial load only
Set unrealistic thresholds
Deploy unvalidated changes

Metrics to Monitor

Error Rate: 5xx errors increase
Latency: P95/P99 response time
Throughput: Requests per second
Resource Usage: CPU, memory
Business Metrics: Conversion rate, revenue
User Experience: Session duration, bounce rate

Resources

GitHub Repository

aj-geddes/useful-ai-prompts

Path: skills/canary-deployment

Related Skills

content-collections

Meta

This skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.

View skill

sglang

Meta

SGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.

View skill

cloudflare-turnstile

Meta

This skill provides comprehensive guidance for implementing Cloudflare Turnstile as a CAPTCHA-alternative bot protection system. It covers integration for forms, login pages, API endpoints, and frameworks like React/Next.js/Hono, while handling invisible challenges that maintain user experience. Use it when migrating from reCAPTCHA, debugging error codes, or implementing token validation and E2E tests.

View skill

Algorithmic Art Generation

Meta

This skill helps developers create algorithmic art using p5.js, focusing on generative art, computational aesthetics, and interactive visualizations. It automatically activates for topics like "generative art" or "p5.js visualization" and guides you through creating unique algorithms with features like seeded randomness, flow fields, and particle systems. Use it when you need to build reproducible, code-driven artistic patterns.

View skill