Technical Whitepaper · v1.0

k8s-ops-toolkit

A small Helm chart and a bootstrap script. Five commands, about eight minutes, production-grade Next.js on Kubernetes.

MIT LicenceHelm 3.16+kube-prometheus-stackLoki + Promtailcert-managerNo service mesh
5Commands
~8 minTo live
~$70/mo platform
< 50 LOCPer template

§ Abstract

Most teams running Next.js on Kubernetes solve the same five problems in their first month: ingress with TLS, autoscaling, metrics, log aggregation, and alert routing. Each is a few hours of work. Together they are a week of yak-shaving before anyone is comfortable pushing to production.

The k8s-ops-toolkit is that week, written down. A Helm chart for the application layer and a bootstrap script for the platform layer. The chart deploys your Next.js app with deployment, service, ingress (TLS via cert-manager), HPA, PDB, and a Prometheus ServiceMonitor. The bootstrap installs ingress-nginx, cert-manager, kube-prometheus-stack, and Loki + Promtail with sane defaults and pre-baked Grafana dashboards.

This whitepaper documents the architectural decisions, the strict definition of "production-grade" we use, and what the toolkit deliberately does not include.

1What "production-grade" means here

We use a strict, opinionated definition. Production-grade means all seven of these are present and working:

  1. TLS by default with automatic renewal.
  2. Autoscaling off CPU at minimum, optionally off custom metrics.
  3. Pod disruption budget so cluster maintenance does not take you down.
  4. Metrics scraped, dashboarded, alertable.
  5. Logs centralised, queryable, retainable.
  6. Health probes that catch broken deploys.
  7. Rolling updates with maxSurge and maxUnavailable tuned correctly.

If any one of these is missing, you are not production-grade. The toolkit ensures none of them are.

2Architecture decisions

Why Helm, not raw yaml or Kustomize

Helm is the lingua franca for Kubernetes app distribution. Every operator, every CI/CD platform, every cluster-as-a-service product knows how to consume a Helm chart. Kustomize is more elegant for some problems but ecosystem support is thinner. Raw yaml is unmaintainable once you have more than two environments.

Why ingress-nginx, not Traefik or HAProxy

ingress-nginx is the most-deployed ingress in the wild. Documentation, examples, and Stack Overflow answers are all biased toward it. Performance is fine for any non-Twitter-scale workload. Traefik is elegant; HAProxy is fast; nginx is what most teams actually run.

Why kube-prometheus-stack, not a custom Prometheus

The chart bundles Prometheus + Grafana + Alertmanager + node exporters + kube-state-metrics + ServiceMonitor CRD. Self-installing all of these correctly takes two days. The kube-prometheus-stack chart does it in 90 seconds. We do not need to be opinionated here; the upstream stack is correct.

Why Loki, not ELK

Loki indexes labels, not log content. That is the cost optimisation: storage is cheap, indexing is expensive. For SME-scale log volumes (under 100GB/day), Loki is roughly an order of magnitude cheaper than Elasticsearch and the query experience inside Grafana is better than Kibana for most ops tasks. If your log volumes are above 1TB/day, Elasticsearch starts to win on ergonomics — we are not solving that case.

Why no service mesh

Service meshes solve real problems: mTLS between services, traffic splitting, retry policies, circuit breaking. They also add operational complexity, latency, and a learning curve. For Next.js apps that are mostly HTTP-in HTTP-out and occasionally talk to a database, a mesh is overkill. If you reach the point where you need one, you are past what this toolkit is solving.

3Chart shape

The chart is intentionally readable. Every template is short. There is no umbrella chart, no library chart. You can read the whole thing in twenty minutes.

charts/nextjs-app/
├── Chart.yaml
├── values.yaml          # all the knobs in one place
└── templates/
    ├── deployment.yaml
    ├── service.yaml
    ├── ingress.yaml     # cert-manager annotations
    ├── hpa.yaml         # CPU/memory autoscaler
    ├── pdb.yaml         # pod disruption budget
    ├── servicemonitor.yaml  # for Prometheus
    └── _helpers.tpl

Common values and their defaults are documented in the Helm Chart reference. The chart expects a /api/health endpoint returning 200 OK; override the path with --set probe.path=/health if you have a different convention. If you want Prometheus to scrape, expose /api/metrics in Prometheus text format and set serviceMonitor.enabled=true.

4Observability stack

The bootstrap script installs the three things you need before you go live: metrics, logs, and alerts. The fourth — distributed tracing — is deliberately not bundled.

LayerToolWhy this one
Metrics scrapePrometheusThe default. Everything speaks Prometheus.
DashboardsGrafanaNothing is faster to point at Prometheus.
Alert routingAlertmanagerComes free with kube-prometheus-stack.
Log aggregationLokiCheaper than ELK at SME scale.
Log shippingPromtailLoki's official agent. Zero config.

Pre-baked dashboards

Three Grafana dashboards ship with the toolkit: Cluster Overview (node CPU, memory, disk; pod restarts; image pull errors), Ingress nginx (RPS, p50/p95/p99 latency, status code distribution, top hosts), Next.js app (HTTP request rate, error rate, p95 latency, Node.js heap usage, event loop lag).

Pre-baked alerts

Default Alertmanager rules: KubePodCrashLooping (restart count > 5 in 10 min), KubePersistentVolumeFillingUp (predicted full within 6 hours), IngressNginxHigh5xxRate (> 5% 5xx for 5 min), IngressNginxHighLatency (p99 > 2s for 5 min), CertManagerCertificateExpirySoon (cert expires in less than 14 days), plus the kube-prometheus defaults.

5Cost

Self-hosted on a 3-node cluster at DigitalOcean prices:

  • Cluster: ~$36/month (3× s-2vcpu-4gb)
  • Persistent disks for Prometheus + Loki: ~$20/month
  • LoadBalancer for ingress: $12/month

Total: roughly $70/month for the platform stack, hosting an unlimited number of Next.js apps. Compare to a managed equivalent (Render, Fly, Railway), which charges per-app and adds up faster as you grow.

Time saved is the larger value. A senior engineer setting up the equivalent stack hand-rolled spends 3-5 days. Spread across a small team that is closer to a week and a half. The toolkit collapses that to an afternoon.

6What the toolkit does not solve

  • Application-level concerns: code quality, business logic, data integrity.
  • Multi-region failover. Single cluster, single region, by design.
  • Compliance frameworks beyond the basics. SOC2, ISO 27001, HIPAA are out of scope; the toolkit is foundational not certifying.
  • Distributed tracing. Add Tempo + OpenTelemetry if you need it; it is not bundled.
  • Long-term metric storage. Prometheus stores 15 days by default. For longer retention, point at Mimir or Cortex.
  • APM (DataDog, New Relic). Pick one if you are already paying for it. Not bundled here.

7Recommended companion repos

  • terraform-stack provisions the cluster, DNS, and storage that this toolkit deploys onto.
  • agent-orchestrator is one example of a Next.js + worker app that fits the chart cleanly.
  • Sarmalink-ai deploys cleanly with this chart's defaults — same shape, same probes, same metrics endpoints.

k8s-ops-toolkit · Built by Sarma Linux · MIT licensed