AWS Quota Management & Availability Handling Guide¶
Status: Planned for v0.6.0 (Q2 2026) Priority: High GitHub Issue: #57
Overview¶
AWS imposes service quotas (formerly called "limits") on resources to protect both AWS infrastructure and customer accounts. Researchers often encounter quota-related launch failures without understanding why or how to resolve them. CloudWorkStation v0.6.0 will provide intelligent quota management and automatic failover capabilities.
Problem Statement¶
Current Pain Points¶
- Opaque Quota Errors: Generic error messages like "The requested configuration is currently not supported" don't explain the underlying quota issue
- No Proactive Awareness: Users don't know they're approaching quota limits until launches fail
- Capacity Failures: InsufficientInstanceCapacityerrors provide no automatic retry logic
- Regional Outages: No awareness of AWS Health events affecting launches
Common Quota Types¶
| Quota Type | Common Default | What It Limits | 
|---|---|---|
| Running On-Demand Standard Instances | 32 vCPUs | Total vCPUs across A, C, D, H, I, M, R, T, Z instance families | 
| Running On-Demand G and VT Instances | 8 vCPUs | GPU instances (P, G, Inf, DL, Trn families) | 
| Running On-Demand F Instances | 8 vCPUs | FPGA instances | 
| Running On-Demand X Instances | 8 vCPUs | High-memory instances | 
| EBS General Purpose SSD (gp3) storage | 50 TiB | Total gp3 volume storage per region | 
| EBS Provisioned IOPS SSD (io2) storage | 50 TiB | Total io2 volume storage per region | 
Example Scenario: A researcher with 24 vCPUs already running tries to launch a p3.8xlarge (32 vCPUs). This would require 56 total vCPUs, exceeding the default 32 vCPU quota → launch fails.
Planned Features (v0.6.0)¶
1. Quota Awareness System¶
Module: pkg/aws/quota_manager.go
Query and track AWS Service Quotas in real-time.
CLI Commands¶
# Show current quota status for default region
cws admin quota show
# Show quota status for specific region
cws admin quota show --region us-west-2
# Show quota status across all regions
cws admin quota show --all-regions
# Show quota history and trends
cws admin quota history --days 30
Example Output¶
$ cws admin quota show --region us-west-2
📊 AWS Service Quotas - us-west-2
vCPU Limits:
  Standard (A, C, D, H, I, M, R, T, Z): 24/32 (75% used) ⚠️
  GPU (P, G, Inf, DL, Trn):             0/8 (0% used) ✅
  High Memory (X, U):                   0/8 (0% used) ✅
Instance Type Limits:
  p3.2xlarge:  0/2 instances available ✅
  r5.xlarge:   3/5 instances available ⚠️ (approaching limit)
  t3.medium:   8/20 instances available ✅
Storage Quotas:
  EBS General Purpose (gp3):      3.2 TiB / 50 TiB ✅
  EBS Provisioned IOPS (io2):     0 TiB / 50 TiB ✅
  EFS Storage:                    73 GB (no regional limit) ✅
Recommendations:
  ⚠️  Standard vCPU usage at 75% - consider requesting increase
  ⚠️  r5.xlarge approaching instance limit (3/5 used)
  ✅ GPU quota sufficient for current workload
Pre-Launch Quota Validation¶
CloudWorkStation will check quotas before attempting launch:
$ cws launch gpu-ml-workstation protein-folding --size XL
⚠️  Quota Check Failed
    Instance type: p3.8xlarge (32 vCPUs, 4 GPUs)
    Current usage: 24/32 vCPUs in us-west-2
    After launch: 56/32 vCPUs ❌ (24 vCPUs over limit)
    You need to request a vCPU quota increase:
    1. Visit AWS Service Quotas Console:
       https://console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-1216C47A
    2. Request new limit: 64 vCPUs
       (Allows 2 simultaneous p3.8xlarge instances)
    3. Typical approval time: 24-48 hours
    Alternative Options:
    1. Launch p3.2xlarge instead? (8 vCPUs, 1 GPU) [Y/n]
    2. Stop existing instances to free quota? [y/N]
    3. Cancel launch [y/N]
Choice:
2. Quota Increase Assistance¶
Module: pkg/aws/quota_requests.go
Help users navigate the quota increase request process.
CLI Commands¶
# Request quota increase with guided workflow
cws admin quota request --instance-type p3.2xlarge \
  --reason "ML research for NIH-funded genomics project" \
  --desired-limit 16
# Check status of pending quota requests
cws admin quota requests list
# View quota request history
cws admin quota requests history
Guided Workflow¶
$ cws admin quota request --instance-type p3.8xlarge
🔍 Analyzing current usage...
Current Quota: 32 vCPUs (Standard)
Current Usage: 24 vCPUs
Requested Instance: p3.8xlarge (32 vCPUs, 4 GPUs)
📋 Quota Increase Request Wizard
1. How many p3.8xlarge instances do you need to run simultaneously?
   [ 2 ]
2. What is the use case? (helps AWS approve faster)
   [ ] Production workload
   [x] Research / Education
   [ ] Development / Testing
   [ ] Disaster recovery
3. Brief description (shown to AWS):
   [ Cancer genomics research using deep learning for tumor classification.
     NIH R01-funded project requiring GPU compute for PyTorch model training. ]
4. Duration of need:
   [x] Ongoing (default)
   [ ] Temporary (specify end date)
✅ Request Summary:
   Current Limit: 32 vCPUs
   Requested Limit: 64 vCPUs
   Justification: Research workload, NIH-funded cancer genomics project
   This request will be submitted to AWS Service Quotas.
   Typical approval time: 24-48 hours
   You will receive email notification when approved.
Submit request? [Y/n]: y
✅ Quota increase request submitted!
   Request ID: quota-12345678
   Track status: cws admin quota requests list
3. Intelligent AZ Failover¶
Module: pkg/aws/availability_manager.go
Automatic retry in different Availability Zones when capacity is unavailable.
How It Works¶
- Detect InsufficientInstanceCapacityerror from EC2
- Automatically retry in different AZ within same region
- Track AZ health per instance type (success rate)
- Prefer AZs with recent successful launches
User Experience¶
$ cws launch bioinformatics-suite genome-analysis
✅ Launching r5.4xlarge in us-west-2a...
⚠️  InsufficientInstanceCapacity in us-west-2a
    AWS reports this instance type is temporarily unavailable in us-west-2a
🔄 Retrying in us-west-2b...
✅ Successfully launched in us-west-2b!
🔗 SSH ready in ~90 seconds...
💡 Note: Future launches will prefer us-west-2b for r5.4xlarge
   (Recent success rate: us-west-2b: 95%, us-west-2a: 60%)
Configuration¶
# Configure AZ failover behavior
cws admin config set az-failover.max-retries 3
cws admin config set az-failover.prefer-successful-azs true
# View AZ health statistics
cws admin availability stats --region us-west-2
# Output:
# 📊 Availability Zone Health - us-west-2
#
# r5.4xlarge:
#   us-west-2a: 12/20 launches successful (60%) ⚠️
#   us-west-2b: 19/20 launches successful (95%) ✅
#   us-west-2c: 18/20 launches successful (90%) ✅
#   us-west-2d: 15/20 launches successful (75%) ⚠️
#
# Recommendation: Prefer us-west-2b or us-west-2c for r5.4xlarge
4. AWS Health Dashboard Integration¶
Module: pkg/aws/health_monitor.go
Monitor AWS Health API for service events affecting launches.
Features¶
- Detect regional outages, degraded performance, scheduled maintenance
- Proactive notifications before launch attempts
- Block launches to affected regions with clear explanations
- Auto-suggest alternative healthy regions
CLI Commands¶
# Check AWS health status for all regions
cws admin aws-health
# Check specific region
cws admin aws-health --region us-east-1
# Subscribe to health alerts
cws admin aws-health subscribe --email devops@university.edu
Pre-Launch Health Check¶
$ cws launch python-ml earthquake-prediction --region us-east-1
⚠️  AWS Health Alert: Degraded EC2 Performance in us-east-1
    Event ID: AWS_EC2_INSTANCE_LAUNCH_FAILURE
    Status: Open (AWS engineers investigating)
    Started: 15 minutes ago
    Impact: Elevated instance launch failures
    Affected AZs: us-east-1a, us-east-1b
    Details: Increased error rates for On-Demand instance launches.
    AWS is actively working to resolve this issue.
    Recommendations:
    1. Use us-west-2 (healthy) ✅
    2. Use eu-west-1 (healthy) ✅
    3. Wait ~30 minutes for resolution ⏱️
    4. Launch anyway (may experience delays) ⚠️
Choice [1-4]:
Important: AWS Health API Requirements¶
AWS Health API requires Business or Enterprise Support for full programmatic access.
| Support Tier | Health API Access | Cost | 
|---|---|---|
| Basic | Console only | Free | 
| Developer | Console only | $29/month | 
| Business | Full API access | $100/month | 
| Enterprise | Full API access | $15,000/month | 
CloudWorkStation will gracefully degrade if Health API is unavailable (Basic/Developer support).
5. Capacity Planning¶
Module: pkg/aws/capacity_planner.go
Analyze historical launch patterns and recommend optimal regions/AZs.
Features¶
- Track launch success rates per region/AZ/instance-type
- Recommend regions with best availability
- Warn about high-demand instance types
- Suggest Spot instances when On-Demand capacity constrained
CLI Commands¶
# Get capacity recommendations for instance type
cws admin capacity recommend --instance-type p3.8xlarge
# View historical capacity data
cws admin capacity history --instance-type p3.8xlarge --days 30
Example Output¶
$ cws admin capacity recommend --instance-type p3.8xlarge
📊 Capacity Recommendations: p3.8xlarge
Best Regions (Last 30 days):
  1. us-west-2:  98% success rate (287/292 launches) ✅
  2. us-east-1:  94% success rate (245/261 launches) ✅
  3. eu-west-1:  91% success rate (156/171 launches) ✅
High-Demand Instance Type: ⚠️
  - p3.8xlarge is frequently capacity-constrained
  - Success rate varies significantly by AZ
  - Consider using Spot instances (60-80% cost savings)
Alternative Options:
  - p3.16xlarge: 92% success rate (more availability)
  - g5.12xlarge: 97% success rate (newer generation, better availability)
Spot Instance Recommendation: ✅
  - Spot availability: 95% (rarely interrupted)
  - Cost savings: $17.10/hr → $5.13/hr (70% off)
  - Recommended for workloads that can tolerate interruption
Integration with Persona Workflows¶
Solo Researcher (Persona 01)¶
Benefit: Pre-launch quota validation prevents failed launches and wasted time - Check quota before launching expensive GPU instance - Guided quota request for ML workload - Automatic AZ failover for high-availability
Lab Environment (Persona 02)¶
Benefit: Multi-user quota management across lab projects - Lab-wide quota tracking (all team members' usage) - Proactive alerts when lab approaches quota limits - Coordinated quota increase requests
University Class (Persona 03)¶
Benefit: Prevent student launch failures during class - Pre-class quota validation (50 students launching simultaneously) - Request quota increase for semester before classes start - Real-time AZ failover during high-demand periods
Conference Workshop (Persona 04)¶
Benefit: Ensure 60-participant workshop launches reliably - Pre-event quota validation and increase requests - AWS Health monitoring to detect regional issues - Automatic AZ failover for workshop instances
Cross-Institutional (Persona 05)¶
Benefit: Multi-region quota management for distributed collaborators - Quota tracking across all collaborator regions - Regional health monitoring for optimal placement - Capacity planning for large-scale multi-institution launches
NIH CUI/PHI Compliance (Personas 06-07)¶
Benefit: Compliance-aware quota management - Ensure compliant regions have sufficient quota - Health monitoring for compliance-critical regions - Documented quota requests for audit trails
Institutional IT (Persona 08)¶
Benefit: Institution-wide quota monitoring and management - Centralized quota dashboard for all researchers - Automated quota increase requests with institutional justification - Cost-optimized capacity planning across departments
Administrator Features¶
Dashboard View¶
$ cws admin quota dashboard
╔══════════════════════════════════════════════════════════════════════╗
║             CloudWorkStation Quota Dashboard - us-west-2            ║
╠══════════════════════════════════════════════════════════════════════╣
║                                                                      ║
║  Overall Health: ✅ Healthy                                          ║
║  Active Researchers: 47                                              ║
║  Running Instances: 123                                              ║
║                                                                      ║
╠══════════════════════════════════════════════════════════════════════╣
║  Quota Status                                                        ║
║                                                                      ║
║  vCPU Quotas:                                                        ║
║    Standard: ████████████████░░░░  1,247/2,048 (61%) ✅              ║
║    GPU:      ███░░░░░░░░░░░░░░░░    32/256 (13%) ✅                 ║
║                                                                      ║
║  At-Risk Researchers:                                                ║
║    dr-johnson: 28/32 vCPUs (88%) ⚠️                                  ║
║    lab-team-3: 62/64 vCPUs (97%) 🚨                                  ║
║                                                                      ║
║  Pending Quota Requests:                                             ║
║    GPU vCPUs: 256 → 512 (requested 2 days ago) ⏳                    ║
║    Standard vCPUs: 2,048 → 4,096 (approved!) ✅                      ║
║                                                                      ║
╠══════════════════════════════════════════════════════════════════════╣
║  Regional Health                                                     ║
║                                                                      ║
║    us-west-2: ✅ Healthy                                             ║
║    us-east-1: ⚠️  Degraded (API_ISSUE, started 45m ago)              ║
║    eu-west-1: ✅ Healthy                                             ║
║                                                                      ║
╚══════════════════════════════════════════════════════════════════════╝
Press 'r' to refresh | Press 'q' to quit
Implementation Timeline¶
| Component | Target | Estimate | Priority | 
|---|---|---|---|
| Quota Awareness | v0.6.0 Sprint 1 | 4-5 days | High | 
| AZ Failover | v0.6.0 Sprint 1 | 3-4 days | High | 
| Quota Increase Assistance | v0.6.0 Sprint 2 | 3-4 days | Medium | 
| AWS Health Integration | v0.6.0 Sprint 2 | 3-4 days | Medium | 
| Capacity Planning | v0.6.0 Sprint 3 | 4-5 days | Low | 
Total Effort: 2-3 weeks Target Release: v0.6.0 (Q2 2026)
Related Documentation¶
- Technical Debt Backlog: TECHNICAL_DEBT_BACKLOG.md (Item #2)
- GitHub Issues: #57, #58, #59, #60
- AWS IAM Permissions: AWS_IAM_PERMISSIONS.md - Required permissions for quota APIs
- Administrator Guide: ADMINISTRATOR_GUIDE.md - General administration
FAQ¶
Q: Will this work with AWS Budgets? A: Yes! Quota management complements AWS Budgets. Quotas limit what you can launch, Budgets limit how much you spend. CloudWorkStation integrates both.
Q: Can I request quota increases automatically? A: No - AWS requires human review for quota increases. CloudWorkStation will guide you through the manual request process with pre-filled forms.
Q: What if I don't have Business Support for Health API? A: CloudWorkStation will gracefully degrade. Basic quota management and AZ failover will still work. Health monitoring requires Business/Enterprise Support.
Q: How often are quotas checked? A: Quotas are checked before every launch and cached for 5 minutes. You can force refresh with cws admin quota show --refresh.
Q: Can I set custom quota thresholds for alerts? A: Yes! Configure via cws admin config set quota.warn-threshold 75 (default: 75%, 90%).
Last Updated: October 20, 2025 Status: Planned Next Review: v0.6.0 Implementation Kickoff