Azure Scenario
Soak Test with Chaos Studio Resilience Drills
Multi-hour load on AKS while Chaos Studio injects pod kills, latency and AZ failures; Monitor SLO decides pass/fail.
Architecture
Azure Load Testing (sustained 500 RPS for 6h)
─► AKS workload under traffic
Chaos Studio experiment (parallel branches)
├─► kill 30% pods in cart-svc
├─► add 200ms latency to checkout
├─► simulate AZ-2 loss
└─► restore
Monitor SLO (success>99%, p95<1s) ─► alert if breached → Service Bus → on-callServices used
Steps
- 1. Baseline
Load test sustains 500 RPS for 30 minutes to establish baseline metrics.
- 2. Inject faults under load
Chaos Studio runs experiment with parallel branches during the soak window.
- 3. Observe
App Insights traces show degraded calls; Monitor SLO tracks burn-rate in real time.
- 4. Recover
Experiment ends; system should self-heal within the RTO; auto-rollback if SLO is breached.
- 5. Report
Function publishes results to Service Bus; dashboard updates with pass/fail per fault.
Takeaways
- Resilience is measured, not assumed.
- Short load tests miss leaks — soak with chaos is the realistic test.
