End-to-End QA Scenario
Soak + Stress Testing with FIS Resilience Drills
Multi-day soak test reveals memory leaks while FIS injects AZ failures and CPU pressure; auto-rollback on SLO burn.
Architecture
k6 sustained 60% capacity for 48h ─► EKS services
│
├─ hour 6 : FIS — kill 25% pods in svc-A
├─ hour 12 : FIS — saturate CPU on RDS reader
├─ hour 24 : FIS — simulate AZ-b loss
└─ hour 36 : FIS — add 200ms p99 latency to dep-X
CloudWatch composite alarm on SLO burn ─► SNS ─► PagerDuty + auto-revert helm releaseWorkflow steps
- 1
Baseline soak
k6 sustains 60% of peak capacity for 48 hours; CloudWatch tracks heap, GC pauses, RDS connections.
- 2
Scripted faults
FIS experiment template fires four faults at preset times to test resilience under load.
- 3
Verify SLO
Composite alarm on success-rate + p95 burn-rate determines if the system survived each fault window.
- 4
Auto-revert
On SLO burn, SNS triggers a Lambda that runs `helm rollback` and pages on-call.
- 5
Post-mortem
Athena report aggregates k6, X-Ray, and event timeline into a single timeline for the review.
Key takeaways
- Short load tests miss leaks — soak finds what a 30-minute run can't.
- Inject faults during load, not in a quiet environment, to see realistic blast radius.
- Automate rollback so the test itself proves the recovery path works.
