container-apps-gpu-2025
About
This skill provides comprehensive documentation on Azure Container Apps' 2025 GPU capabilities for developers building AI/ML workloads. It covers key features like serverless GPU with scale-to-zero billing and Dapr integration for simplified microservices. Use this when implementing cost-efficient, scalable containerized applications requiring GPU acceleration.
Quick Install
Claude Code
Recommended/plugin add https://github.com/majiayu000/claude-skill-registrygit clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/container-apps-gpu-2025Copy and paste this command in Claude Code to install this skill
Documentation
Azure Container Apps GPU Support - 2025 Features
Complete knowledge base for Azure Container Apps with GPU support, serverless capabilities, and Dapr integration (2025 GA features).
Overview
Azure Container Apps is a serverless container platform with native GPU support, Dapr integration, and scale-to-zero capabilities for cost-efficient AI/ML workloads.
Key 2025 Features (Build Announcements)
1. Serverless GPU (GA)
- Automatic scaling: Scale GPU workloads based on demand
- Scale-to-zero: Pay only when GPU is actively used
- Per-second billing: Granular cost control
- Optimized cold start: Fast initialization for AI models
- Reduced operational overhead: No infrastructure management
2. Dedicated GPU (GA)
- Consistent performance: Dedicated GPU resources
- Simplified AI deployment: Easy model hosting
- Long-running workloads: Ideal for training and continuous inference
- Multiple GPU types: NVIDIA A100, T4, and more
3. Dynamic Sessions with GPU (Early Access)
- Sandboxed execution: Run untrusted AI-generated code
- Hyper-V isolation: Enhanced security
- GPU-powered Python interpreter: Handle compute-intensive AI workloads
- Scale at runtime: Dynamic resource allocation
4. Foundry Models Integration
- Deploy AI models directly: During container app creation
- Ready-to-use models: Pre-configured inference endpoints
- Azure AI Foundry: Seamless integration
5. Workflow with Durable Task Scheduler (Preview)
- Long-running workflows: Reliable orchestration
- State management: Automatic persistence
- Event-driven: Trigger workflows from events
6. Native Azure Functions Support
- Functions runtime: Run Azure Functions in Container Apps
- Consistent development: Same code, serverless execution
- Event triggers: All Functions triggers supported
7. Dapr Integration (GA)
- Service discovery: Built-in DNS-based discovery
- State management: Distributed state stores
- Pub/sub messaging: Reliable messaging patterns
- Service invocation: Resilient service-to-service calls
- Observability: Integrated tracing and metrics
Creating Container Apps with GPU
Basic Container App with Serverless GPU
# Create Container Apps environment
az containerapp env create \
--name myenv \
--resource-group MyRG \
--location eastus \
--logs-workspace-id <workspace-id> \
--logs-workspace-key <workspace-key>
# Create Container App with GPU
az containerapp create \
--name myapp-gpu \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/ai-model:latest \
--cpu 4 \
--memory 8Gi \
--gpu-type nvidia-a100 \
--gpu-count 1 \
--min-replicas 0 \
--max-replicas 10 \
--ingress external \
--target-port 8080
Production-Ready Container App with GPU
az containerapp create \
--name myapp-gpu-prod \
--resource-group MyRG \
--environment myenv \
\
# Container configuration
--image myregistry.azurecr.io/ai-model:latest \
--registry-server myregistry.azurecr.io \
--registry-identity system \
\
# Resources
--cpu 4 \
--memory 8Gi \
--gpu-type nvidia-a100 \
--gpu-count 1 \
\
# Scaling
--min-replicas 0 \
--max-replicas 20 \
--scale-rule-name http-scaling \
--scale-rule-type http \
--scale-rule-http-concurrency 10 \
\
# Networking
--ingress external \
--target-port 8080 \
--transport http2 \
--exposed-port 8080 \
\
# Security
--registry-identity system \
--env-vars "AZURE_CLIENT_ID=secretref:client-id" \
\
# Monitoring
--dapr-app-id myapp \
--dapr-app-port 8080 \
--dapr-app-protocol http \
--enable-dapr \
\
# Identity
--system-assigned
Container Apps Environment Configuration
Environment with Zone Redundancy
az containerapp env create \
--name myenv-prod \
--resource-group MyRG \
--location eastus \
--logs-workspace-id <workspace-id> \
--logs-workspace-key <workspace-key> \
--zone-redundant true \
--enable-workload-profiles true
Workload Profiles (Dedicated GPU)
# Create environment with workload profiles
az containerapp env create \
--name myenv-gpu \
--resource-group MyRG \
--location eastus \
--enable-workload-profiles true
# Add GPU workload profile
az containerapp env workload-profile add \
--name myenv-gpu \
--resource-group MyRG \
--workload-profile-name gpu-profile \
--workload-profile-type GPU-A100 \
--min-nodes 0 \
--max-nodes 10
# Create container app with GPU profile
az containerapp create \
--name myapp-dedicated-gpu \
--resource-group MyRG \
--environment myenv-gpu \
--workload-profile-name gpu-profile \
--image myregistry.azurecr.io/training-job:latest \
--cpu 8 \
--memory 16Gi \
--min-replicas 1 \
--max-replicas 5
GPU Scaling Rules
Custom Prometheus Scaling
az containerapp create \
--name myapp-gpu-prometheus \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/ai-model:latest \
--cpu 4 \
--memory 8Gi \
--gpu-type nvidia-a100 \
--gpu-count 1 \
--min-replicas 0 \
--max-replicas 10 \
--scale-rule-name gpu-utilization \
--scale-rule-type custom \
--scale-rule-custom-type prometheus \
--scale-rule-metadata \
serverAddress=http://prometheus.monitoring.svc.cluster.local:9090 \
metricName=gpu_utilization \
threshold=80 \
query="avg(nvidia_gpu_utilization{app='myapp'})"
Queue-Based Scaling (Azure Service Bus)
az containerapp create \
--name myapp-queue-processor \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/batch-processor:latest \
--cpu 4 \
--memory 8Gi \
--gpu-type nvidia-t4 \
--gpu-count 1 \
--min-replicas 0 \
--max-replicas 50 \
--scale-rule-name queue-scaling \
--scale-rule-type azure-servicebus \
--scale-rule-metadata \
queueName=ai-jobs \
namespace=myservicebus \
messageCount=5 \
--scale-rule-auth connection=servicebus-connection
Dapr Integration
Enable Dapr on Container App
az containerapp create \
--name myapp-dapr \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/myapp:latest \
--enable-dapr \
--dapr-app-id myapp \
--dapr-app-port 8080 \
--dapr-app-protocol http \
--dapr-http-max-request-size 4 \
--dapr-http-read-buffer-size 4 \
--dapr-log-level info \
--dapr-enable-api-logging true
Dapr State Store (Azure Cosmos DB)
# Create Dapr component for state store
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: statestore
spec:
type: state.azure.cosmosdb
version: v1
metadata:
- name: url
value: "https://mycosmosdb.documents.azure.com:443/"
- name: masterKey
secretRef: cosmosdb-key
- name: database
value: "mydb"
- name: collection
value: "state"
# Create the component
az containerapp env dapr-component set \
--name myenv \
--resource-group MyRG \
--dapr-component-name statestore \
--yaml component.yaml
Dapr Pub/Sub (Azure Service Bus)
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: pubsub
spec:
type: pubsub.azure.servicebus.topics
version: v1
metadata:
- name: connectionString
secretRef: servicebus-connection
- name: consumerID
value: "myapp"
Service-to-Service Invocation
# Python example using Dapr SDK
from dapr.clients import DaprClient
with DaprClient() as client:
# Invoke another service
response = client.invoke_method(
app_id='other-service',
method_name='process',
data='{"input": "data"}'
)
# Save state
client.save_state(
store_name='statestore',
key='mykey',
value='myvalue'
)
# Publish message
client.publish_event(
pubsub_name='pubsub',
topic_name='orders',
data='{"orderId": "123"}'
)
AI Model Deployment Patterns
OpenAI-Compatible Endpoint
# Dockerfile for vLLM model serving
FROM vllm/vllm-openai:latest
ENV MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
ENV GPU_MEMORY_UTILIZATION=0.9
ENV MAX_MODEL_LEN=4096
CMD ["--model", "${MODEL_NAME}", \
"--gpu-memory-utilization", "${GPU_MEMORY_UTILIZATION}", \
"--max-model-len", "${MAX_MODEL_LEN}", \
"--port", "8080"]
# Deploy vLLM model
az containerapp create \
--name llama-inference \
--resource-group MyRG \
--environment myenv \
--image vllm/vllm-openai:latest \
--cpu 8 \
--memory 32Gi \
--gpu-type nvidia-a100 \
--gpu-count 1 \
--min-replicas 1 \
--max-replicas 5 \
--target-port 8080 \
--ingress external \
--env-vars \
MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
GPU_MEMORY_UTILIZATION="0.9" \
HF_TOKEN=secretref:huggingface-token
Stable Diffusion Image Generation
az containerapp create \
--name stable-diffusion \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/stable-diffusion:latest \
--cpu 4 \
--memory 16Gi \
--gpu-type nvidia-a100 \
--gpu-count 1 \
--min-replicas 0 \
--max-replicas 10 \
--target-port 7860 \
--ingress external \
--scale-rule-name http-scaling \
--scale-rule-type http \
--scale-rule-http-concurrency 1
Batch Processing Job
az containerapp job create \
--name batch-training-job \
--resource-group MyRG \
--environment myenv \
--trigger-type Manual \
--image myregistry.azurecr.io/training:latest \
--cpu 8 \
--memory 32Gi \
--gpu-type nvidia-a100 \
--gpu-count 2 \
--parallelism 1 \
--replica-timeout 7200 \
--replica-retry-limit 3 \
--env-vars \
DATASET_URL="https://mystorage.blob.core.windows.net/datasets/train.csv" \
MODEL_OUTPUT="https://mystorage.blob.core.windows.net/models/" \
EPOCHS="100"
# Execute job
az containerapp job start \
--name batch-training-job \
--resource-group MyRG
Monitoring and Observability
Application Insights Integration
az containerapp create \
--name myapp-monitored \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/myapp:latest \
--env-vars \
APPLICATIONINSIGHTS_CONNECTION_STRING=secretref:appinsights-connection
Query Logs
# Stream logs
az containerapp logs show \
--name myapp-gpu \
--resource-group MyRG \
--follow
# Query with Log Analytics
az monitor log-analytics query \
--workspace <workspace-id> \
--analytics-query "ContainerAppConsoleLogs_CL | where ContainerAppName_s == 'myapp-gpu' | take 100"
Metrics and Alerts
# Create metric alert for GPU usage
az monitor metrics alert create \
--name high-gpu-usage \
--resource-group MyRG \
--scopes $(az containerapp show -g MyRG -n myapp-gpu --query id -o tsv) \
--condition "avg Requests > 100" \
--window-size 5m \
--evaluation-frequency 1m \
--action <action-group-id>
Security Best Practices
Managed Identity
# Create with system-assigned identity
az containerapp create \
--name myapp-identity \
--resource-group MyRG \
--environment myenv \
--system-assigned \
--image myregistry.azurecr.io/myapp:latest
# Get identity principal ID
IDENTITY_ID=$(az containerapp show -g MyRG -n myapp-identity --query identity.principalId -o tsv)
# Assign role to access Key Vault
az role assignment create \
--assignee $IDENTITY_ID \
--role "Key Vault Secrets User" \
--scope /subscriptions/<sub-id>/resourceGroups/MyRG/providers/Microsoft.KeyVault/vaults/mykeyvault
# Use user-assigned identity
az identity create --name myapp-identity --resource-group MyRG
IDENTITY_RESOURCE_ID=$(az identity show -g MyRG -n myapp-identity --query id -o tsv)
az containerapp create \
--name myapp-user-identity \
--resource-group MyRG \
--environment myenv \
--user-assigned $IDENTITY_RESOURCE_ID \
--image myregistry.azurecr.io/myapp:latest
Secret Management
# Add secrets
az containerapp secret set \
--name myapp-gpu \
--resource-group MyRG \
--secrets \
huggingface-token="<token>" \
api-key="<key>"
# Reference secrets in environment variables
az containerapp update \
--name myapp-gpu \
--resource-group MyRG \
--set-env-vars \
HF_TOKEN=secretref:huggingface-token \
API_KEY=secretref:api-key
Cost Optimization
Scale-to-Zero Configuration
az containerapp create \
--name myapp-scale-zero \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/myapp:latest \
--min-replicas 0 \
--max-replicas 10 \
--scale-rule-name http-scaling \
--scale-rule-type http \
--scale-rule-http-concurrency 10
Cost savings: Pay only when requests are being processed. GPU costs are per-second when active.
Right-Sizing Resources
# Start with minimal resources
--cpu 2 --memory 4Gi --gpu-count 1
# Monitor and adjust based on actual usage
az monitor metrics list \
--resource $(az containerapp show -g MyRG -n myapp-gpu --query id -o tsv) \
--metric "CpuPercentage,MemoryPercentage"
Use Spot/Preemptible GPUs (Future Feature)
When available, configure spot instances for non-critical workloads to save up to 80% on GPU costs.
Troubleshooting
Check Revision Status
az containerapp revision list \
--name myapp-gpu \
--resource-group MyRG \
--output table
View Revision Details
az containerapp revision show \
--name <revision-name> \
--app myapp-gpu \
--resource-group MyRG
Restart Container App
az containerapp update \
--name myapp-gpu \
--resource-group MyRG \
--force-restart
GPU Not Available
If GPU is not provisioning:
- Check region availability: Not all regions support GPU
- Verify quota: Request quota increase if needed
- Check workload profile: Ensure GPU workload profile is created
Best Practices
✓ Use scale-to-zero for intermittent workloads ✓ Implement health probes (liveness and readiness) ✓ Use managed identities for authentication ✓ Store secrets in Azure Key Vault ✓ Enable Dapr for microservices patterns ✓ Configure appropriate scaling rules ✓ Monitor GPU utilization and adjust resources ✓ Use Container Apps jobs for batch processing ✓ Implement retry logic for transient failures ✓ Use Application Insights for observability
References
Azure Container Apps with GPU support provides the ultimate serverless platform for AI/ML workloads!
GitHub Repository
Related Skills
sglang
MetaSGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.
evaluating-llms-harness
TestingThis Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.
langchain
MetaLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.
llamaguard
OtherLlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.
