All scenarios

End-to-End QA Scenario

Kubernetes Test Runner on EKS with Results in RDS

An EKS-native test platform: SQS dispatches jobs, the Kubernetes Jobs controller spins up Playwright/k6 pods, artifacts land in S3 and structured results stream into RDS PostgreSQL for dashboards and trend analysis.

Architecture

EventBridge / API ─► SQS (test-jobs)
                              │
                  KEDA scaler on EKS reads queue depth
                              │
                  K8s Job per message (Playwright / k6 / Newman pod)
                  ├─ initContainer: pull config from Secrets Manager (DB creds, base URL)
                  ├─ main: runs suite, writes JUnit + traces to /artifacts
                  └─ sidecar: results-reporter
                         ├─► S3 (raw artifacts: junit.xml, video, trace.zip)
                         └─► RDS PostgreSQL
                                ├─ test_runs(run_id, suite, commit, started_at, status, p95_ms)
                                ├─ test_cases(run_id, name, status, duration_ms, error)
                                └─ artifacts(run_id, s3_uri, kind)
                  K8s Job TTL controller cleans pods after 1h
                              │
                  CloudWatch Container Insights ─► pod CPU/mem, OOM alerts
                  FIS (optional) ─► inject node drains to validate runner resilience

Workflow steps

  1. 1

    Package runners

    Build Playwright, k6 and Newman runner images; push to ECR with scan-on-push. Each image embeds the results-reporter sidecar binary.

  2. 2

    Provision DB schema

    Run Flyway/Liquibase migration on RDS to create `test_runs`, `test_cases`, `artifacts` tables with indexes on `(suite, started_at)` and `(run_id)`.

  3. 3

    Enqueue work

    CI (CodePipeline) or EventBridge schedule pushes a JSON job to SQS: `{ runId, suite, shard, commit, targetUrl }`.

  4. 4

    Scale on demand

    KEDA `ScaledJob` on EKS watches SQS queue length and creates one Kubernetes Job per message, capped by `maxReplicaCount`. Pods run on Fargate or a managed nodegroup; TTL controller deletes finished pods after 1h.

  5. 5

    Run + report

    Main container executes the suite. The sidecar tails JUnit output, uploads raw artifacts to S3 (`s3://qa-artifacts/{runId}/...`) and inserts one row per case into RDS using IAM DB auth — no long-lived passwords.

  6. 6

    Observe

    CloudWatch Container Insights tracks pod CPU/memory and OOMKills; alarms page on-call when failure rate or pod restart count breaches threshold.

  7. 7

    Query + dashboard

    Grafana (or QuickSight) connects to RDS for live dashboards: pass-rate trend, top-10 flaky tests, p95 duration per suite, regressions vs. last green commit.

  8. 8

    Chaos validation

    AWS FIS experiment drains a nodegroup mid-run; verify Jobs reschedule, SQS visibility timeout protects in-flight work, and no duplicate result rows appear (idempotent insert on `(run_id, case_name)`).

Key takeaways

  • Kubernetes Jobs + KEDA give per-test isolation and scale-to-zero between runs.
  • Splitting raw artifacts (S3) from structured results (RDS) keeps the DB cheap and queryable.
  • Idempotent inserts keyed on `(run_id, case_name)` survive pod restarts and node drains.
  • Container Insights + FIS turn the test platform itself into something you can SLO and stress-test.