Running Node.js in production

Moving a Node.js app from a laptop into production is the single highest-leverage improvement most teams never finish — and the single biggest source of 3 a.m. pages when they skip it. This post walks through the patterns that actually matter, with code you can lift into your own service today.

If you'd rather jump around, the headings are anchored — bookmark the /blogs/nodejs-in-production URL with whichever section you're in.

1. Process supervision

Node runs as a single process. That process will crash — the only question is whether something restarts it, and how fast.

Three options, from lightest to heaviest:

pm2 — a Node-native process manager. Great for single-host deployments, zero-downtime reloads, and clustering out of the box.
systemd — the right call on bare-metal or EC2 boxes where you don't want yet another daemon.
Container orchestration (Kubernetes / ECS / Nomad) — the default choice once you have more than two boxes.

A minimal pm2 config (ecosystem.config.js):

JavaScript

module.exports = {
  apps: [
    {
      name: "api",
      script: "./dist/server.js",
      instances: "max",       // one per CPU core
      exec_mode: "cluster",   // round-robin across workers
      max_memory_restart: "512M",
      env: {
        NODE_ENV: "production",
      },
    },
  ],
};

Run it with pm2 start ecosystem.config.js and it stays up across reboots once you wire pm2 startup into init.

Rule of thumb — whatever you pick, measure restart latency. A 4-second cold boot on a hot path is an SLO violation under any load you care about.

2. Scaling across cores with `cluster`

Node doesn't use other cores unless you tell it to. The built-in cluster module forks workers and load-balances TCP connections between them:

TypeScript

import cluster from "node:cluster";
import os from "node:os";
import { startServer } from "./server.js";

if (cluster.isPrimary) {
  const workers = os.availableParallelism();
  for (let i = 0; i < workers; i++) cluster.fork();
  cluster.on("exit", (worker) => {
    console.warn(`worker ${worker.process.pid} died, respawning`);
    cluster.fork();
  });
} else {
  startServer();
}

Two traps to know:

Sticky sessions — if you're using in-memory sessions, requests from the same user may land on different workers. Either move session state to Redis or use a sticky load balancer.
Shared state — anything in module-level variables is per-worker. Caches, counters, rate limiters — all of them duplicate.

3. Error boundaries that actually work

The Node runtime has two last-chance hooks before it hard-exits. Register them both, log aggressively, and do not pretend to recover — the process is in an undefined state by definition.

TypeScript

process.on("uncaughtException", (err) => {
  logger.fatal({ err }, "uncaughtException");
  setTimeout(() => process.exit(1), 200); // let the log flush
});

process.on("unhandledRejection", (reason) => {
  logger.fatal({ reason }, "unhandledRejection");
  setTimeout(() => process.exit(1), 200);
});

The supervisor (pm2 / k8s / systemd) will start a fresh process, and that's fine. process.exit(1) after a short flush window is the only safe recovery strategy.

What about `domain`?

The domain module has been deprecated for years. Don't use it. If you're doing request-scoped error isolation, use AsyncLocalStorage instead.

4. Structured logging — not `console.log`

console.log writes strings. Your log aggregator wants JSON. Pick a real logger, pin a format, and stop splattering raw strings across stdout:

Logger	Throughput (logs/s)	Structured	Good for
`pino`	~1,000,000	✅	High-throughput services
`winston`	~80,000	✅	Legacy, plugin-rich ecosystem
`bunyan`	~300,000	✅	Teams already on bunyan
`debug`	n/a	❌	Local dev only

A production pino setup looks like:

TypeScript

import pino from "pino";

export const logger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  base: { service: "api", region: process.env.AWS_REGION },
  redact: {
    paths: ["*.authorization", "*.password", "*.token"],
    censor: "[REDACTED]",
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

Key things this buys you:

Every log line already has service and region set
Sensitive fields auto-redacted
ISO timestamps your aggregator understands without parsing

5. Graceful shutdown

When Kubernetes sends SIGTERM, you have ~30 seconds to finish in-flight requests and close everything cleanly. Most services shed traffic the instant they receive SIGTERM and then hard-exit — silently dropping in-flight work. That's a bug.

TypeScript

import { createServer } from "node:http";
import { once } from "node:events";

const server = createServer(handler);
server.listen(PORT);

async function shutdown(signal: NodeJS.Signals) {
  logger.info({ signal }, "shutdown begin");
  server.closeIdleConnections();
  await once(server.close((e) => e && logger.error({ err: e })), "close");
  await db.end();
  await queue.disconnect();
  logger.info("shutdown complete");
  process.exit(0);
}

process.on("SIGTERM", () => shutdown("SIGTERM"));
process.on("SIGINT", () => shutdown("SIGINT"));

Order matters:

Stop accepting new connections (server.close())
Drain in-flight requests (waiting for the close event)
Release external resources (DB pool, queue, Redis)
Exit

If step 2 is still blocked after ~25 s, exit anyway — Kubernetes will SIGKILL you at 30 s regardless, and a clean-ish shutdown beats a crash-looking one every time.

6. Observability, not just logs

Logs alone are search-only. You want three signals:

Metrics — request rate, latency percentiles, error rate, event-loop lag
Traces — per-request spans across services (OpenTelemetry)
Logs — structured, already covered
Profiles — continuous CPU/heap profiling via 0x or Datadog Continuous Profiler

Event-loop lag is the one metric most teams don't collect and then get surprised by:

TypeScript

import { monitorEventLoopDelay } from "node:perf_hooks";

const h = monitorEventLoopDelay({ resolution: 20 });
h.enable();

setInterval(() => {
  metrics.gauge("eventloop.lag.p99.ms", h.percentile(99) / 1e6);
  h.reset();
}, 5_000);

If the p99 lag creeps above ~50 ms, your Node process is CPU-bound — either move that work to a worker thread, offload it to a queue, or scale out.

Event loop lag over a typical deploy

Sample lag chart — flat is good, spikes are not.

7. The production-readiness checklist

Before routing real traffic at a service, run this list. Every line is something I've been bitten by at least once:

The last one is the one everyone skips. Don't skip it — the drill is where you find the actual bugs.

8. Quick-reference cheat sheet

Some things worth committing to muscle memory:

Want to…	Tool / pattern
Restart on crash	`pm2` / `systemd` / container orchestrator
Use all CPU cores	`cluster` or `pm2 -i max`
Ship structured logs	`pino` + log aggregator
Trace requests across services	OpenTelemetry SDK
Measure event-loop lag	`perf_hooks.monitorEventLoopDelay`
Hot-reload during dev	`node --watch` (Node ≥ 18.11)
Debug a running process	`kill -SIGUSR1 <pid>` + Chrome devtools

To attach Chrome devtools to a running process: send SIGUSR1 to it, then open chrome://inspect. This works against ~~unsupported~~ every Node version since 8.