How it works

How k8s-ops-toolkit works

A complete tour of the architecture: data flow, subsystems, technology choices, performance, and where the project is heading next.

TL;DR

A small Helm chart deploys your Next.js app with deployment, service, ingress (TLS via cert-manager), HPA, PDB, and a Prometheus ServiceMonitor. A bash bootstrap script installs ingress-nginx, cert-manager, kube-prometheus-stack, and Loki + Promtail with sane defaults. Three Grafana dashboards and a default Alertmanager ruleset ship with the repo. Five commands, about eight minutes, production-grade.

Core data flow

From the moment a request enters the system to the moment a response leaves it.

  Internet
       │
       ▼
  ingress-nginx (LoadBalancer)
       │ TLS terminated by cert-manager (Let's Encrypt)
       ▼
  Service (ClusterIP)
       │
       ▼
  Deployment (Next.js pods)
       │  /api/health (readiness)
       │  /api/metrics (Prometheus scrape)
       │
       ├─── HPA scales replicas on CPU / RPS
       │
       ├─── PDB protects against simultaneous evictions
       │
       │
  ┌────┴───────────────────────────────────┐
  │  Prometheus (kube-prometheus-stack)     │
  │   - ServiceMonitor selects pods         │
  │   - 15-day retention default            │
  │                                         │
  │  Loki + Promtail                        │
  │   - DaemonSet ships container logs       │
  │   - Indexed by namespace + app + pod    │
  │                                         │
  │  Grafana                                │
  │   - 3 pre-baked dashboards              │
  │   - Loki + Prometheus datasources       │
  │                                         │
  │  Alertmanager                           │
  │   - Default rules (CrashLoop, 5xx,      │
  │     latency, cert expiry, disk press.)  │
  │   - Receivers configured per cluster    │
  └─────────────────────────────────────────┘

Each subsystem, deep-dived

Every component in the data flow above, opened up and explained.

The Helm chart

The chart at charts/nextjs-app is intentionally readable. Six template files plus a helpers partial. Every template is short. There is no umbrella chart, no library chart. Common knobs live in values.yaml: image, replicaCount, resources, env vars, ingress host and TLS, autoscaling, PDB, ServiceMonitor.

The deployment template wires probes (/api/health by default, override with --set probe.path=/health), security context, and rolling-update strategy. The ingress template includes cert-manager annotations so a TLS certificate is requested automatically when you set ingress.tls.enabled=true. The ServiceMonitor template emits the CRD that kube-prometheus-stack uses to discover and scrape your pods.

The bootstrap script

scripts/install.sh takes two arguments: the email for Let’s Encrypt registration and the apex domain you control. It installs four upstream charts in order: ingress-nginx (with the LoadBalancer service that becomes your public IP), cert-manager (with a letsencrypt-prod ClusterIssuer using the email you passed), kube-prometheus-stack (Prometheus, Grafana, Alertmanager, node exporters), and Loki + Promtail.

The whole thing takes about 3 minutes on a 3-node cluster. After it completes, the chart can install your Next.js app and the certificate will be issued within 60 seconds of the DNS record resolving.

cert-manager + Let's Encrypt

cert-manager watches Ingress resources for cert-manager.io/cluster-issuer annotations. The chart sets this annotation when ingress.tls.enabled=true, pointing at the letsencrypt-prod ClusterIssuer that the bootstrap installed. The HTTP-01 challenge is solved through the same ingress, which is why the order matters: ingress-nginx first, cert-manager second, then your apps.

Renewals happen automatically 30 days before expiry. The default Alertmanager ruleset includes CertManagerCertificateExpirySoon, which fires if a certificate is within 14 days of expiry — early enough to investigate before customers notice.

Prometheus + Grafana

kube-prometheus-stack does the heavy lifting. Prometheus scrapes any pod whose Service has a matching ServiceMonitor; the chart emits one for your app. Grafana ships with the standard cluster and ingress dashboards, plus a custom Next.js app dashboard that expects the conventional http_requests_total, http_request_duration_seconds, and Node.js process metrics from prom-client.

Default retention is 15 days. For longer retention, point Prometheus at remote storage (Mimir, Cortex, or Grafana Cloud) — the toolkit deliberately stops at the local-retention layer.

Loki + Promtail

Promtail runs as a DaemonSet, tailing every container log on every node. Logs are shipped to Loki with three indexed labels: namespace, app, pod. Everything else stays in the log line; this is the cost optimisation that makes Loki cheaper than ELK at SME scale.

Querying happens inside Grafana via the LogQL language: {namespace="default", app="my-app"} |= "error". For request-rate views, LogQL aggregations work the way Prometheus aggregations do: sum by (status) (rate({app="my-app"}[5m])).

Alertmanager rules

The default ruleset covers what actually goes wrong: pod crash loops, ingress 5xx spikes, p99 latency regressions, certificate near-expiry, persistent volume disk pressure, node memory pressure, node not ready. Receivers (Slack, PagerDuty, email) are configured in values-alertmanager.yaml per cluster.

You can read the templates, copy them, fork them. No magic, no controller to debug.

Not this

A custom operator — overkill for a single-app chart.

Performance & observability

The chart itself adds no measurable runtime overhead — it is yaml that becomes a Deployment + Service + Ingress + HPA + PDB. Performance is whatever your app and the upstream tools deliver.

kube-prometheus-stack’s Prometheus instance comfortably handles 10k samples/sec on a default 1 vCPU / 2GB allocation, more if you bump resources. Loki at the recommended footprint handles 50–100GB/day of logs without strain. The whole observability bundle uses approximately 2 vCPU and 4GB of cluster RAM at idle, more under load.

Cert issuance runs once per certificate, not per request. ingress-nginx routes are O(1) per request after TLS termination. The HPA polls metrics every 15 seconds by default; tune lower if you need faster scale-up but accept more flapping.

The cluster footprint we recommend (3 × 2 vCPU / 4GB nodes on DigitalOcean) sustains around 50 small Next.js apps before becoming a concern. Past that, scale the cluster and consider a separate observability namespace on a dedicated node pool.

Where it is heading

→Argo Rollouts integration for canary and blue-green deploys via values flags.
→A second chart for FastAPI/Python apps with the same shape and the same ServiceMonitor.
→Velero install script for namespaced backup and restore.
→A pre-baked cost dashboard reading kube-state-metrics + node prices to show spend per namespace.
→OpenTelemetry collector chart with sane defaults for trace shipping to Tempo.

Read the full whitepaper for the formal technical write-up.

Whitepaper Repository Get help shipping