End-to-End QA Scenario
Kubernetes Test Runner on EKS with Results in RDS
An EKS-native test platform: SQS dispatches jobs, the Kubernetes Jobs controller spins up Playwright/k6 pods, artifacts land in S3 and structured results stream into RDS PostgreSQL for dashboards and trend analysis.
Architecture
EventBridge / API ─► SQS (test-jobs)
│
KEDA scaler on EKS reads queue depth
│
K8s Job per message (Playwright / k6 / Newman pod)
├─ initContainer: pull config from Secrets Manager (DB creds, base URL)
├─ main: runs suite, writes JUnit + traces to /artifacts
└─ sidecar: results-reporter
├─► S3 (raw artifacts: junit.xml, video, trace.zip)
└─► RDS PostgreSQL
├─ test_runs(run_id, suite, commit, started_at, status, p95_ms)
├─ test_cases(run_id, name, status, duration_ms, error)
└─ artifacts(run_id, s3_uri, kind)
K8s Job TTL controller cleans pods after 1h
│
CloudWatch Container Insights ─► pod CPU/mem, OOM alerts
FIS (optional) ─► inject node drains to validate runner resilienceWorkflow steps
- 1
Package runners
Build Playwright, k6 and Newman runner images; push to ECR with scan-on-push. Each image embeds the results-reporter sidecar binary.
- 2
Provision DB schema
Run Flyway/Liquibase migration on RDS to create `test_runs`, `test_cases`, `artifacts` tables with indexes on `(suite, started_at)` and `(run_id)`.
- 3
Enqueue work
CI (CodePipeline) or EventBridge schedule pushes a JSON job to SQS: `{ runId, suite, shard, commit, targetUrl }`.
- 4
Scale on demand
KEDA `ScaledJob` on EKS watches SQS queue length and creates one Kubernetes Job per message, capped by `maxReplicaCount`. Pods run on Fargate or a managed nodegroup; TTL controller deletes finished pods after 1h.
- 5
Run + report
Main container executes the suite. The sidecar tails JUnit output, uploads raw artifacts to S3 (`s3://qa-artifacts/{runId}/...`) and inserts one row per case into RDS using IAM DB auth — no long-lived passwords.
- 6
Observe
CloudWatch Container Insights tracks pod CPU/memory and OOMKills; alarms page on-call when failure rate or pod restart count breaches threshold.
- 7
Query + dashboard
Grafana (or QuickSight) connects to RDS for live dashboards: pass-rate trend, top-10 flaky tests, p95 duration per suite, regressions vs. last green commit.
- 8
Chaos validation
AWS FIS experiment drains a nodegroup mid-run; verify Jobs reschedule, SQS visibility timeout protects in-flight work, and no duplicate result rows appear (idempotent insert on `(run_id, case_name)`).
Key takeaways
- Kubernetes Jobs + KEDA give per-test isolation and scale-to-zero between runs.
- Splitting raw artifacts (S3) from structured results (RDS) keeps the DB cheap and queryable.
- Idempotent inserts keyed on `(run_id, case_name)` survive pod restarts and node drains.
- Container Insights + FIS turn the test platform itself into something you can SLO and stress-test.
