Next.js on Kubernetes, production-grade in five commands.
A Helm chart for the app and a version-pinned bootstrap for the platform. Ingress, TLS, autoscaling, metrics, logs, alerts, spend tracking and an optional GitOps path. None of the yaml on your side.
Why this exists
Most teams reach for Kubernetes when they outgrow Vercel or want to cut a hosting bill that has stopped making sense. Then they spend two weeks configuring the same things every other team configures: ingress, cert-manager, monitoring, logging, autoscaling, secrets. The end result is fine; the path to it is a waste.
The big ecosystems solve much larger problems. Argo and Crossplane bring serious machinery for serious orgs. Backstage brings a developer portal. The lighter starters often skip observability entirely and leave the next person to wire metrics by hand.
The toolkit is the middle. A small Helm chart you can read in twenty minutes, a single version-pinned bash installer for the platform, an ArgoCD app-of-apps if you prefer GitOps, and the dashboards and alert rules already opinionated for Next.js workloads. The week you would have spent, given back.
Why this matters
The first week on any new cluster is identical across teams: ingress, TLS, autoscaling, metrics, logs, alerts. Burning it every project is a tax. The toolkit pays that tax once, in public, and pins the answers so every cluster after this one inherits them. The time you keep is the entire point.
What is in the box
Everything below is pinned, tested, and wired together by the installer. Nothing here is aspirational.
Helm chart for Next.js
Deployment with tuned rolling update strategy, hardened pod and container security context, ClusterIP service, ingress with cert-manager TLS, HorizontalPodAutoscaler, PodDisruptionBudget, liveness and readiness probes on /api/health, and a Prometheus ServiceMonitor scraping /api/metrics.
cert-manager with Let's Encrypt
The installer creates the Let's Encrypt production ClusterIssuer and wires every ingress to request a real certificate. Automatic renewal. A bundled alert fires when a certificate is within fourteen days of expiry.
kube-prometheus-stack
Prometheus, Grafana, Alertmanager and node-exporter installed from the upstream community chart at a pinned version. ServiceMonitor on the chart picks up app metrics without extra config.
Loki 3.x with Promtail
Logs go from stdout to Promtail to Loki, queryable from Grafana with the same Explore UI as metrics. Labels keep the index small; the bulk lives on object storage.
Alertmanager rules
Bundled PrometheusRule covers crash-looping pods, ingress-nginx 5xx spikes above five percent, p99 latency above two seconds, certificate expiry inside fourteen days, and PV space predicted to exhaust within six hours. Slack webhook wired by the installer.
HPA on CPU by default
Default min 2, max 10, target 70 percent CPU. A documented pattern in the wiki swaps in custom-metrics HPA via the ServiceMonitor for requests-per-second autoscaling when you need it.
ingress-nginx, documented
The ingress everyone runs. The chart documents annotations for body-size limits, websocket support, redirects, and per-host TLS. ingress-nginx is the LoadBalancer-typed entry point.
PodDisruptionBudget on by default
minAvailable: 1 keeps a floor of replicas during voluntary disruptions such as node drains and cluster upgrades. Disable per-release if a workload prefers full availability over safe drains.
No service mesh by design
For a small fleet of Next.js workloads, mesh complexity is rarely worth the operational cost. The toolkit deliberately does not ship one. You add Linkerd or Istio when you have a reason.
Plain Helm, no operator
You can read every template, copy it, fork it. No CRDs to learn beyond what cert-manager and Prometheus already require. No hidden state.
OpenCost spend dashboard
OpenCost is pinned and installed with the rest. A bundled Grafana dashboard breaks down cluster spend by namespace and workload so you can see where the money goes.
ArgoCD app-of-apps for GitOps
Prefer reconciliation from git over a bash installer? The same components are described as an Argo Application set under gitops/argocd. Apply once, git is the source of truth.
End-to-end pytest suite
A real Helm render is fed into pytest fixtures that assert pod-selector match, service target-port wiring, TLS wiring, gating of optional objects, and version parity between the installer and the GitOps Applications.
Version-pinned everything
Every upstream chart version is declared in scripts/install.sh and mirrored in the Argo Applications. The same command produces the same platform every time.
Cluster topology
Internet traffic enters through ingress-nginx, lands on the Next.js pods, and produces metrics and logs that Prometheus, Loki and OpenCost feed back into Grafana and Alertmanager.
Request path
What a user request actually touches, from the LoadBalancer to the streamed response.
Quick start
Five commands. About eight minutes from an empty cluster to ingress, TLS, metrics, logs, alerts, spend tracking, and your first app running.
git clone https://github.com/sarmakska/k8s-ops-toolkit.git cd k8s-ops-toolkit export KUBECONFIG=~/.kube/your-cluster.yaml
./scripts/install.sh \ --domain example.com \ --email you@example.com \ --slack-webhook https://hooks.slack.com/services/...
./scripts/load-dashboards.sh # Cluster Overview, Next.js app, OpenCost spend installed as sidecar ConfigMaps
helm install my-app ./charts/nextjs-app \ --set image.repository=ghcr.io/you/my-app \ --set image.tag=v1.0.0 \ --set ingress.host=app.example.com \ --set replicas=3
kubectl -n monitoring port-forward svc/kube-prometheus-stack-grafana 3000:80
# user: admin · pwd: kubectl -n monitoring get secret kube-prometheus-stack-grafana -o jsonpath='{.data.admin-password}' | base64 -dA real values.yaml
The actual default values that ship with charts/nextjs-app. Sensible defaults, then override only what your service needs.
replicas: 2
image:
repository: ghcr.io/your-org/your-app
tag: latest
pullPolicy: IfNotPresent
pullSecrets: []
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
service:
port: 3000
ingress:
enabled: true
className: nginx
host: app.example.com
annotations: {}
tls:
enabled: true
issuer: letsencrypt-prod
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 1000m, memory: 1Gi }
autoscaling:
enabled: true
min: 2
max: 10
targetCPU: 70
pdb:
enabled: true
minAvailable: 1
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile: { type: RuntimeDefault }
containerSecurityContext:
allowPrivilegeEscalation: false
capabilities:
drop: [ALL]
probes:
liveness:
path: /api/health
initialDelaySeconds: 30
periodSeconds: 10
readiness:
path: /api/health
initialDelaySeconds: 5
periodSeconds: 5
monitoring:
enabled: true
prometheusServiceMonitor: true
metricsPath: /api/metrics
metricsPort: 3000
interval: 30s
serviceMonitorLabels:
release: monitoringFull reference: Helm-Chart wiki page
Platform components
Every upstream pinned in scripts/install.sh and mirrored in gitops/argocd. Same versions either path.
| Component | Purpose |
|---|---|
| ingress-nginx | Layer-7 ingress controller exposed as a LoadBalancer service. The cluster's only public endpoint. |
| cert-manager | Issues and renews TLS certificates via Let's Encrypt. ClusterIssuer is created by the installer. |
| kube-prometheus-stack | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics. The metrics backbone. |
| Loki 3.x | Log aggregation. Cheap to run because it indexes labels, not log content. |
| Promtail | Sidecar-less log shipper. Reads container stdout from the node, ships to Loki. |
| OpenCost | Spend attribution. Queries Prometheus for utilisation, emits cost-per-namespace and cost-per-workload. |
| Alertmanager | Routes alerts to Slack via the installer-supplied webhook. PagerDuty or Opsgenie one variable away. |
Bundled alert rules
PrometheusRule under manifests/prometheus-rules/app-rules.yaml. Loaded automatically when the kube-prometheus-stack release label matches.
| Alert | Severity | Fires when |
|---|---|---|
| KubePodCrashLooping | critical | A container restarts more than five times in ten minutes |
| KubePersistentVolumeFillingUp | warning | A PV is predicted to run out of space within six hours |
| IngressNginxHigh5xxRate | critical | Ingress 5xx ratio above five percent for five minutes |
| IngressNginxHighLatency | warning | Ingress p99 latency above two seconds for five minutes |
| CertManagerCertificateExpirySoon | warning | A certificate has not been renewed within fourteen days of expiry |
Slack webhook wired by the installer through manifests/values-alertmanager.yaml.
The install script
The platform install is a single bash script. Every chart version is pinned. Idempotent.
#!/usr/bin/env bash
set -euo pipefail
INGRESS_VERSION=4.11.3
CERT_MANAGER_VERSION=v1.15.3
KPS_VERSION=65.1.0
LOKI_VERSION=6.16.0
PROMTAIL_VERSION=6.16.5
OPENCOST_VERSION=2.4.6
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo add jetstack https://charts.jetstack.io
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \
--version "$INGRESS_VERSION" -n ingress-nginx --create-namespace
helm upgrade --install cert-manager jetstack/cert-manager \
--version "$CERT_MANAGER_VERSION" -n cert-manager --create-namespace \
--set installCRDs=true
kubectl apply -f - <<EOF
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata: { name: letsencrypt-prod }
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ${EMAIL}
privateKeySecretRef: { name: letsencrypt-prod }
solvers: [ { http01: { ingress: { class: nginx } } } ]
EOF
helm upgrade --install monitoring prometheus-community/kube-prometheus-stack \
--version "$KPS_VERSION" -n monitoring --create-namespace \
-f manifests/values-alertmanager.yaml
helm upgrade --install loki grafana/loki \
--version "$LOKI_VERSION" -n monitoring -f manifests/values-loki.yaml
helm upgrade --install promtail grafana/promtail \
--version "$PROMTAIL_VERSION" -n monitoring
helm upgrade --install opencost opencost/opencost \
--version "$OPENCOST_VERSION" -n monitoringTruncated for the page. The real script also wires the Slack webhook, waits for cert-manager webhooks, and prints a readiness summary.
GitOps with ArgoCD
Prefer the platform reconciled from git instead of installed by hand? Apply the app-of-apps root once.
Full GitOps walkthrough: GitOps wiki page
Use cases
What teams actually run this for.
First production cluster
Greenfield team going from "we deploy to Vercel" to "we run our own k8s." Skip the week of yak-shaving on ingress, TLS, autoscaling and metrics.
Adding observability later
Apps already running but no metrics or logs. The installer drops in Prometheus, Loki and Grafana in an afternoon without touching workloads.
Standardising deploys
Pin every Next.js service in your org to the same chart. Consistent probes, consistent autoscaling, consistent alerts, consistent labels.
Cost-controlled SaaS infrastructure
One DigitalOcean cluster hosting an arbitrary number of services. OpenCost surfaces where the spend lives. Predictable bill.
Platform team with multiple Next.js services
Each service installs the chart with its own values file. Helm release name is the unit of isolation. ArgoCD reconciles the platform from git.
Staging environments that look like prod
Same install script, smaller node pool. Real TLS, real metrics, real alerts, a fraction of the spend.
k8s-ops-toolkit vs alternatives
How the toolkit compares to other ways to put a Next.js app on Kubernetes. Honest scope-by-scope.
| Feature | k8s-ops-toolkit | Stock Helm + bash | Backstage | Pure ArgoCD | Vercel |
|---|---|---|---|---|---|
| Helm chart for Next.js | Yes, opinionated | Build yourself | Via plugin | Bring your own | N/A |
| TLS via cert-manager | Pinned + wired | Manual install | Out of scope | Manual install | Managed |
| Prometheus + Grafana | Pinned + dashboards | Manual install | Out of scope | Manual install | Managed |
| Loki for logs | Pinned | Manual install | Out of scope | Manual install | Managed |
| OpenCost spend | Pinned + dashboard | Manual install | Out of scope | Manual install | Limited |
| GitOps reconcile | ArgoCD app-of-apps | Bring your own | Out of scope | Yes, native | N/A |
| E2E test suite | pytest renders chart | No | No | No | N/A |
| Licence | MIT | MIT components | Apache 2.0 | Apache 2.0 | Commercial |
| Total time to live | About 8 minutes | Days | Days | Hours | Minutes (managed) |
Tech stack
Every piece pinned. No surprise minor-version drift.
Documentation & guides
The wiki is the deep reference. Architecture, dashboards, alert rules, and the GitOps path are written down.
Frequently asked
The questions that come up most often before adoption.
Why not just use Vercel?+
For some teams Vercel is the right answer forever. For others, three or four services on Vercel cost more than a single $70 a month DigitalOcean cluster that hosts an arbitrary number of apps. This toolkit is for the day you cross that line.
Does it lock me into ingress-nginx?+
No. ingress-nginx is the default because it is the controller most teams already run and the one the bundled rules and dashboards target. Swap to Traefik or Contour and the chart still works; you would re-author the ingress-specific alerts and dashboards.
How is this different from Argo, Crossplane, Backstage?+
Those solve much larger problems and bring much heavier machinery. This toolkit is the small platform-layer most teams need. The ArgoCD app-of-apps is an opt-in path, not a replacement for the imperative installer.
Can I run the installer twice?+
Yes. Every step uses helm upgrade --install. The script is idempotent: re-running it converges on the same pinned versions and the same values.
How do I add a custom Grafana dashboard?+
Drop a JSON file into manifests/grafana-dashboards/ and re-run scripts/load-dashboards.sh. The sidecar discovers ConfigMaps with the grafana_dashboard label and loads them on the next reconcile.
What about secrets management?+
The chart supports inline env, individual secret-backed env, and whole-Secret envFrom mounting. Sealed Secrets or External Secrets Operator are documented patterns; neither is pinned by default because the right choice is team-specific.
Does autoscaling on CPU cover real-world Next.js?+
For most workloads, yes. For request-bound services with long-tail latency, the wiki includes a pattern for HPA on requests-per-second sourced from the ServiceMonitor via Prometheus Adapter.
How are upgrades managed?+
Upstream chart versions are pinned in scripts/install.sh and gitops/argocd. Bumping a version is a single edit, a re-run of the installer (or an Argo sync), and the e2e pytest suite to verify the chart still renders cleanly.
Stop yak-shaving the platform
Clone the repo, run the installer, deploy the chart. The same five commands every time, on every cluster.
Related projects
Part of a portfolio of production-shaped open-source repos.