End-to-End QA Scenario

Kubernetes Test Runner on EKS with Results in RDS

An EKS-native test platform: SQS dispatches jobs, the Kubernetes Jobs controller spins up Playwright/k6 pods, artifacts land in S3 and structured results stream into RDS PostgreSQL for dashboards and trend analysis.

Architecture

EventBridge / API ─► SQS (test-jobs)
                              │
                  KEDA scaler on EKS reads queue depth
                              │
                  K8s Job per message (Playwright / k6 / Newman pod)
                  ├─ initContainer: pull config from Secrets Manager (DB creds, base URL)
                  ├─ main: runs suite, writes JUnit + traces to /artifacts
                  └─ sidecar: results-reporter
                         ├─► S3 (raw artifacts: junit.xml, video, trace.zip)
                         └─► RDS PostgreSQL
                                ├─ test_runs(run_id, suite, commit, started_at, status, p95_ms)
                                ├─ test_cases(run_id, name, status, duration_ms, error)
                                └─ artifacts(run_id, s3_uri, kind)
                  K8s Job TTL controller cleans pods after 1h
                              │
                  CloudWatch Container Insights ─► pod CPU/mem, OOM alerts
                  FIS (optional) ─► inject node drains to validate runner resilience

Workflow steps

1
Package runners
Build Playwright, k6 and Newman runner images; push to ECR with scan-on-push. Each image embeds the results-reporter sidecar binary.
2
Provision DB schema
Run Flyway/Liquibase migration on RDS to create `test_runs`, `test_cases`, `artifacts` tables with indexes on `(suite, started_at)` and `(run_id)`.
3
Enqueue work
CI (CodePipeline) or EventBridge schedule pushes a JSON job to SQS: `{ runId, suite, shard, commit, targetUrl }`.
4
Scale on demand
KEDA `ScaledJob` on EKS watches SQS queue length and creates one Kubernetes Job per message, capped by `maxReplicaCount`. Pods run on Fargate or a managed nodegroup; TTL controller deletes finished pods after 1h.
5
Run + report
Main container executes the suite. The sidecar tails JUnit output, uploads raw artifacts to S3 (`s3://qa-artifacts/{runId}/...`) and inserts one row per case into RDS using IAM DB auth — no long-lived passwords.
6
Observe
CloudWatch Container Insights tracks pod CPU/memory and OOMKills; alarms page on-call when failure rate or pod restart count breaches threshold.
7
Query + dashboard
Grafana (or QuickSight) connects to RDS for live dashboards: pass-rate trend, top-10 flaky tests, p95 duration per suite, regressions vs. last green commit.
8
Chaos validation
AWS FIS experiment drains a nodegroup mid-run; verify Jobs reschedule, SQS visibility timeout protects in-flight work, and no duplicate result rows appear (idempotent insert on `(run_id, case_name)`).

Key takeaways

Kubernetes Jobs + KEDA give per-test isolation and scale-to-zero between runs.
Splitting raw artifacts (S3) from structured results (RDS) keeps the DB cheap and queryable.
Idempotent inserts keyed on `(run_id, case_name)` survive pod restarts and node drains.
Container Insights + FIS turn the test platform itself into something you can SLO and stress-test.

Architecture

Workflow steps

Package runners

Provision DB schema

Enqueue work

Scale on demand

Run + report

Observe

Query + dashboard

Chaos validation

Key takeaways