All scenarios

End-to-End QA Scenario

Soak + Stress Testing with FIS Resilience Drills

Multi-day soak test reveals memory leaks while FIS injects AZ failures and CPU pressure; auto-rollback on SLO burn.

Architecture

k6 sustained 60% capacity for 48h ─► EKS services
           │
           ├─ hour 6   : FIS — kill 25% pods in svc-A
           ├─ hour 12  : FIS — saturate CPU on RDS reader
           ├─ hour 24  : FIS — simulate AZ-b loss
           └─ hour 36  : FIS — add 200ms p99 latency to dep-X
CloudWatch composite alarm on SLO burn ─► SNS ─► PagerDuty + auto-revert helm release

Workflow steps

  1. 1

    Baseline soak

    k6 sustains 60% of peak capacity for 48 hours; CloudWatch tracks heap, GC pauses, RDS connections.

  2. 2

    Scripted faults

    FIS experiment template fires four faults at preset times to test resilience under load.

  3. 3

    Verify SLO

    Composite alarm on success-rate + p95 burn-rate determines if the system survived each fault window.

  4. 4

    Auto-revert

    On SLO burn, SNS triggers a Lambda that runs `helm rollback` and pages on-call.

  5. 5

    Post-mortem

    Athena report aggregates k6, X-Ray, and event timeline into a single timeline for the review.

Key takeaways

  • Short load tests miss leaks — soak finds what a 30-minute run can't.
  • Inject faults during load, not in a quiet environment, to see realistic blast radius.
  • Automate rollback so the test itself proves the recovery path works.