Moving a Node.js app from a laptop into production is the single highest-leverage improvement most teams never finish — and the single biggest source of 3 a.m. pages when they skip it. This post walks through the patterns that actually matter, with code you can lift into your own service today.
If you’d rather jump around, the headings are anchored — bookmark the /blogs/nodejs-in-production URL with whichever section you’re in.
1. Process supervision
Node runs as a single process. That process will crash — the only question is whether something restarts it, and how fast.
Three options, from lightest to heaviest:
pm2— a Node-native process manager. Great for single-host deployments, zero-downtime reloads, and clustering out of the box.systemd— the right call on bare-metal or EC2 boxes where you don’t want yet another daemon.- Container orchestration (Kubernetes / ECS / Nomad) — the default choice once you have more than two boxes.
A minimal pm2 config (ecosystem.config.js):
module.exports = {
apps: [
{
name: "api",
script: "./dist/server.js",
instances: "max", // one per CPU core
exec_mode: "cluster", // round-robin across workers
max_memory_restart: "512M",
env: {
NODE_ENV: "production",
},
},
],
};Run it with pm2 start ecosystem.config.js and it stays up across reboots once you wire pm2 startup into init.
Rule of thumb — whatever you pick, measure restart latency. A 4-second cold boot on a hot path is an SLO violation under any load you care about.
2. Scaling across cores with cluster
Node doesn’t use other cores unless you tell it to. The built-in cluster module forks workers and load-balances TCP connections between them:
import cluster from "node:cluster";
import os from "node:os";
import { startServer } from "./server.js";
if (cluster.isPrimary) {
const workers = os.availableParallelism();
for (let i = 0; i < workers; i++) cluster.fork();
cluster.on("exit", (worker) => {
console.warn(`worker ${worker.process.pid} died, respawning`);
cluster.fork();
});
} else {
startServer();
}Two traps to know:
- Sticky sessions — if you’re using in-memory sessions, requests from the same user may land on different workers. Either move session state to Redis or use a sticky load balancer.
- Shared state — anything in module-level variables is per-worker. Caches, counters, rate limiters — all of them duplicate.
3. Error boundaries that actually work
The Node runtime has two last-chance hooks before it hard-exits. Register them both, log aggressively, and do not pretend to recover — the process is in an undefined state by definition.
process.on("uncaughtException", (err) => {
logger.fatal({ err }, "uncaughtException");
setTimeout(() => process.exit(1), 200); // let the log flush
});
process.on("unhandledRejection", (reason) => {
logger.fatal({ reason }, "unhandledRejection");
setTimeout(() => process.exit(1), 200);
});The supervisor (pm2 / k8s / systemd) will start a fresh process, and that’s fine. process.exit(1) after a short flush window is the only safe recovery strategy.
What about domain?
The domain module has been deprecated for years. Don’t use it. If you’re doing request-scoped error isolation, use AsyncLocalStorage instead.
4. Structured logging — not console.log
console.log writes strings. Your log aggregator wants JSON. Pick a real logger, pin a format, and stop splattering raw strings across stdout:
| Logger | Throughput (logs/s) | Structured | Good for |
|---|---|---|---|
pino | ~1,000,000 | ✅ | High-throughput services |
winston | ~80,000 | ✅ | Legacy, plugin-rich ecosystem |
bunyan | ~300,000 | ✅ | Teams already on bunyan |
debug | n/a | ❌ | Local dev only |
A production pino setup looks like:
import pino from "pino";
export const logger = pino({
level: process.env.LOG_LEVEL ?? "info",
base: { service: "api", region: process.env.AWS_REGION },
redact: {
paths: ["*.authorization", "*.password", "*.token"],
censor: "[REDACTED]",
},
timestamp: pino.stdTimeFunctions.isoTime,
});Key things this buys you:
- Every log line already has
serviceandregionset - Sensitive fields auto-redacted
- ISO timestamps your aggregator understands without parsing
5. Graceful shutdown
When Kubernetes sends SIGTERM, you have ~30 seconds to finish in-flight requests and close everything cleanly. Most services shed traffic the instant they receive SIGTERM and then hard-exit — silently dropping in-flight work. That’s a bug.
import { createServer } from "node:http";
import { once } from "node:events";
const server = createServer(handler);
server.listen(PORT);
async function shutdown(signal: NodeJS.Signals) {
logger.info({ signal }, "shutdown begin");
server.closeIdleConnections();
await once(server.close((e) => e && logger.error({ err: e })), "close");
await db.end();
await queue.disconnect();
logger.info("shutdown complete");
process.exit(0);
}
process.on("SIGTERM", () => shutdown("SIGTERM"));
process.on("SIGINT", () => shutdown("SIGINT"));Order matters:
- Stop accepting new connections (
server.close()) - Drain in-flight requests (waiting for the
closeevent) - Release external resources (DB pool, queue, Redis)
- Exit
If step 2 is still blocked after ~25 s, exit anyway — Kubernetes will SIGKILL you at 30 s regardless, and a clean-ish shutdown beats a crash-looking one every time.
6. Observability, not just logs
Logs alone are search-only. You want three signals:
- Metrics — request rate, latency percentiles, error rate, event-loop lag
- Traces — per-request spans across services (OpenTelemetry )
- Logs — structured, already covered
- Profiles — continuous CPU/heap profiling via
0xor Datadog Continuous Profiler
Event-loop lag is the one metric most teams don’t collect and then get surprised by:
import { monitorEventLoopDelay } from "node:perf_hooks";
const h = monitorEventLoopDelay({ resolution: 20 });
h.enable();
setInterval(() => {
metrics.gauge("eventloop.lag.p99.ms", h.percentile(99) / 1e6);
h.reset();
}, 5_000);If the p99 lag creeps above ~50 ms, your Node process is CPU-bound — either move that work to a worker thread, offload it to a queue, or scale out.
Sample lag chart — flat is good, spikes are not.
7. The production-readiness checklist
Before routing real traffic at a service, run this list. Every line is something I’ve been bitten by at least once:
- Process supervisor restarts on crash
-
uncaughtExceptionandunhandledRejectionhooks wired - Structured logging with redaction
- Graceful
SIGTERMhandling with a budget - Health endpoints:
/healthz(liveness) and/readyz(readiness) - Metrics exported (Prometheus format or vendor SDK)
- Request-level tracing
- Event-loop lag emitted as a metric
- Outbound HTTP calls timeboxed with
AbortController - Dependency pinning via lockfile (
package-lock.json/pnpm-lock.yaml) - CI runs
npm auditand a type-check - Chaos drill: kill -9 a pod under load, watch the SLO
The last one is the one everyone skips. Don’t skip it — the drill is where you find the actual bugs.
8. Quick-reference cheat sheet
Some things worth committing to muscle memory:
| Want to… | Tool / pattern |
|---|---|
| Restart on crash | pm2 / systemd / container orchestrator |
| Use all CPU cores | cluster or pm2 -i max |
| Ship structured logs | pino + log aggregator |
| Trace requests across services | OpenTelemetry SDK |
| Measure event-loop lag | perf_hooks.monitorEventLoopDelay |
| Hot-reload during dev | node --watch (Node ≥ 18.11) |
| Debug a running process | kill -SIGUSR1 <pid> + Chrome devtools |
To attach Chrome devtools to a running process: send SIGUSR1 to it, then open chrome://inspect. This works against unsupported every Node version since 8.
Further reading
These are the three documents I actually re-read every few months:
If you’ve got a pattern from your own production stack that belongs on the checklist, I’d love to hear it — drop me a note via the links on the home page.