End-to-End QA Scenario

Soak + Stress Testing with FIS Resilience Drills

Multi-day soak test reveals memory leaks while FIS injects AZ failures and CPU pressure; auto-rollback on SLO burn.

Architecture

k6 sustained 60% capacity for 48h ─► EKS services
           │
           ├─ hour 6   : FIS — kill 25% pods in svc-A
           ├─ hour 12  : FIS — saturate CPU on RDS reader
           ├─ hour 24  : FIS — simulate AZ-b loss
           └─ hour 36  : FIS — add 200ms p99 latency to dep-X
CloudWatch composite alarm on SLO burn ─► SNS ─► PagerDuty + auto-revert helm release

Workflow steps

1
Baseline soak
k6 sustains 60% of peak capacity for 48 hours; CloudWatch tracks heap, GC pauses, RDS connections.
2
Scripted faults
FIS experiment template fires four faults at preset times to test resilience under load.
3
Verify SLO
Composite alarm on success-rate + p95 burn-rate determines if the system survived each fault window.
4
Auto-revert
On SLO burn, SNS triggers a Lambda that runs `helm rollback` and pages on-call.
5
Post-mortem
Athena report aggregates k6, X-Ray, and event timeline into a single timeline for the review.

Key takeaways

Short load tests miss leaks — soak finds what a 30-minute run can't.
Inject faults during load, not in a quiet environment, to see realistic blast radius.
Automate rollback so the test itself proves the recovery path works.

Architecture

Workflow steps

Baseline soak

Scripted faults

Verify SLO

Auto-revert

Post-mortem

Key takeaways