How k8s-ops-toolkit works
A complete tour of the architecture: data flow, subsystems, technology choices, performance, and where the project is heading next.
A small Helm chart deploys your Next.js app with deployment, service, ingress (TLS via cert-manager), HPA, PDB, and a Prometheus ServiceMonitor. A bash bootstrap script installs ingress-nginx, cert-manager, kube-prometheus-stack, and Loki + Promtail with sane defaults. Three Grafana dashboards and a default Alertmanager ruleset ship with the repo. Five commands, about eight minutes, production-grade.
Core data flow
From the moment a request enters the system to the moment a response leaves it.
Internet
│
▼
ingress-nginx (LoadBalancer)
│ TLS terminated by cert-manager (Let's Encrypt)
▼
Service (ClusterIP)
│
▼
Deployment (Next.js pods)
│ /api/health (readiness)
│ /api/metrics (Prometheus scrape)
│
├─── HPA scales replicas on CPU / RPS
│
├─── PDB protects against simultaneous evictions
│
│
┌────┴───────────────────────────────────┐
│ Prometheus (kube-prometheus-stack) │
│ - ServiceMonitor selects pods │
│ - 15-day retention default │
│ │
│ Loki + Promtail │
│ - DaemonSet ships container logs │
│ - Indexed by namespace + app + pod │
│ │
│ Grafana │
│ - 3 pre-baked dashboards │
│ - Loki + Prometheus datasources │
│ │
│ Alertmanager │
│ - Default rules (CrashLoop, 5xx, │
│ latency, cert expiry, disk press.) │
│ - Receivers configured per cluster │
└─────────────────────────────────────────┘Each subsystem, deep-dived
Every component in the data flow above, opened up and explained.
The Helm chart
The chart at charts/nextjs-app is intentionally readable. Six template files plus a helpers partial. Every template is short. There is no umbrella chart, no library chart. Common knobs live in values.yaml: image, replicaCount, resources, env vars, ingress host and TLS, autoscaling, PDB, ServiceMonitor.
The deployment template wires probes (/api/health by default, override with --set probe.path=/health), security context, and rolling-update strategy. The ingress template includes cert-manager annotations so a TLS certificate is requested automatically when you set ingress.tls.enabled=true. The ServiceMonitor template emits the CRD that kube-prometheus-stack uses to discover and scrape your pods.
The bootstrap script
scripts/install.sh takes two arguments: the email for Let’s Encrypt registration and the apex domain you control. It installs four upstream charts in order: ingress-nginx (with the LoadBalancer service that becomes your public IP), cert-manager (with a letsencrypt-prod ClusterIssuer using the email you passed), kube-prometheus-stack (Prometheus, Grafana, Alertmanager, node exporters), and Loki + Promtail.
The whole thing takes about 3 minutes on a 3-node cluster. After it completes, the chart can install your Next.js app and the certificate will be issued within 60 seconds of the DNS record resolving.
cert-manager + Let's Encrypt
cert-manager watches Ingress resources for cert-manager.io/cluster-issuer annotations. The chart sets this annotation when ingress.tls.enabled=true, pointing at the letsencrypt-prod ClusterIssuer that the bootstrap installed. The HTTP-01 challenge is solved through the same ingress, which is why the order matters: ingress-nginx first, cert-manager second, then your apps.
Renewals happen automatically 30 days before expiry. The default Alertmanager ruleset includes CertManagerCertificateExpirySoon, which fires if a certificate is within 14 days of expiry — early enough to investigate before customers notice.
Prometheus + Grafana
kube-prometheus-stack does the heavy lifting. Prometheus scrapes any pod whose Service has a matching ServiceMonitor; the chart emits one for your app. Grafana ships with the standard cluster and ingress dashboards, plus a custom Next.js app dashboard that expects the conventional http_requests_total, http_request_duration_seconds, and Node.js process metrics from prom-client.
Default retention is 15 days. For longer retention, point Prometheus at remote storage (Mimir, Cortex, or Grafana Cloud) — the toolkit deliberately stops at the local-retention layer.
Loki + Promtail
Promtail runs as a DaemonSet, tailing every container log on every node. Logs are shipped to Loki with three indexed labels: namespace, app, pod. Everything else stays in the log line; this is the cost optimisation that makes Loki cheaper than ELK at SME scale.
Querying happens inside Grafana via the LogQL language: {namespace="default", app="my-app"} |= "error". For request-rate views, LogQL aggregations work the way Prometheus aggregations do: sum by (status) (rate({app="my-app"}[5m])).
Alertmanager rules
The default ruleset covers what actually goes wrong: pod crash loops, ingress 5xx spikes, p99 latency regressions, certificate near-expiry, persistent volume disk pressure, node memory pressure, node not ready. Receivers (Slack, PagerDuty, email) are configured in values-alertmanager.yaml per cluster.
Adding a custom alert is editing one yaml file in manifests/prometheus-rules/ and applying it. Alertmanager reloads within 30 seconds.
Why this stack
The road not taken matters as much as the road taken. Here is what was picked, why, and what was rejected and why.
Helm 3
The lingua franca for Kubernetes app distribution. Every CI/CD platform consumes Helm charts.
Kustomize — more elegant for some problems but ecosystem support is thinner. Raw yaml — unmaintainable across environments.
ingress-nginx
The most-deployed ingress in the wild. Documentation, examples, Stack Overflow are all biased toward it.
Traefik (elegant), HAProxy (fast). Both fine. nginx is what most teams actually run, so support questions resolve faster.
cert-manager + Let's Encrypt
Free TLS at scale, automatic renewal, native Kubernetes integration via CRDs.
A commercial CA — fine if you need EV certs. Most workloads do not.
kube-prometheus-stack
Bundles Prometheus, Grafana, Alertmanager, node exporters, kube-state-metrics, ServiceMonitor CRD. Self-installing all of these correctly takes two days.
A custom Prometheus install — you would re-implement this chart. We did not need to be opinionated; the upstream is correct.
Loki
Indexes labels, not log content. Roughly an order of magnitude cheaper than Elasticsearch at SME log volumes (under 100GB/day).
ELK — wins on ergonomics above 1TB/day; we are not solving that case.
No service mesh
Mesh complexity is rarely worth the cost for Next.js workloads, which are mostly HTTP-in HTTP-out.
Istio, Linkerd — solve real problems (mTLS, traffic split, retries) but add latency, learning, and operational burden.
Plain Helm, no operator
You can read the templates, copy them, fork them. No magic, no controller to debug.
A custom operator — overkill for a single-app chart.
Performance & observability
The chart itself adds no measurable runtime overhead — it is yaml that becomes a Deployment + Service + Ingress + HPA + PDB. Performance is whatever your app and the upstream tools deliver.
kube-prometheus-stack’s Prometheus instance comfortably handles 10k samples/sec on a default 1 vCPU / 2GB allocation, more if you bump resources. Loki at the recommended footprint handles 50–100GB/day of logs without strain. The whole observability bundle uses approximately 2 vCPU and 4GB of cluster RAM at idle, more under load.
Cert issuance runs once per certificate, not per request. ingress-nginx routes are O(1) per request after TLS termination. The HPA polls metrics every 15 seconds by default; tune lower if you need faster scale-up but accept more flapping.
The cluster footprint we recommend (3 × 2 vCPU / 4GB nodes on DigitalOcean) sustains around 50 small Next.js apps before becoming a concern. Past that, scale the cluster and consider a separate observability namespace on a dedicated node pool.
Where it is heading
- →Argo Rollouts integration for canary and blue-green deploys via values flags.
- →A second chart for FastAPI/Python apps with the same shape and the same ServiceMonitor.
- →Velero install script for namespaced backup and restore.
- →A pre-baked cost dashboard reading kube-state-metrics + node prices to show spend per namespace.
- →OpenTelemetry collector chart with sane defaults for trace shipping to Tempo.
Read the full whitepaper for the formal technical write-up.