Skip to content
back to writing
5 min readnodejs · production · devops

Running Node.js in production

A hands-on guide to process management, clustering, error boundaries, logging, graceful shutdowns, and the observability that separates a toy service from one you'd put your pager number on.

RG
Rahul Gupta
Senior Software Engineer
share

Moving a Node.js app from a laptop into production is the single highest-leverage improvement most teams never finish — and the single biggest source of 3 a.m. pages when they skip it. This post walks through the patterns that actually matter, with code you can lift into your own service today.

If you’d rather jump around, the headings are anchored — bookmark the /blogs/nodejs-in-production URL with whichever section you’re in.

1. Process supervision

Node runs as a single process. That process will crash — the only question is whether something restarts it, and how fast.

Three options, from lightest to heaviest:

  1. pm2 — a Node-native process manager. Great for single-host deployments, zero-downtime reloads, and clustering out of the box.
  2. systemd — the right call on bare-metal or EC2 boxes where you don’t want yet another daemon.
  3. Container orchestration (Kubernetes / ECS / Nomad) — the default choice once you have more than two boxes.

A minimal pm2 config (ecosystem.config.js):

JavaScript
module.exports = {
  apps: [
    {
      name: "api",
      script: "./dist/server.js",
      instances: "max",       // one per CPU core
      exec_mode: "cluster",   // round-robin across workers
      max_memory_restart: "512M",
      env: {
        NODE_ENV: "production",
      },
    },
  ],
};

Run it with pm2 start ecosystem.config.js and it stays up across reboots once you wire pm2 startup into init.

Rule of thumb — whatever you pick, measure restart latency. A 4-second cold boot on a hot path is an SLO violation under any load you care about.

2. Scaling across cores with cluster

Node doesn’t use other cores unless you tell it to. The built-in cluster module forks workers and load-balances TCP connections between them:

TypeScript
import cluster from "node:cluster";
import os from "node:os";
import { startServer } from "./server.js";
 
if (cluster.isPrimary) {
  const workers = os.availableParallelism();
  for (let i = 0; i < workers; i++) cluster.fork();
  cluster.on("exit", (worker) => {
    console.warn(`worker ${worker.process.pid} died, respawning`);
    cluster.fork();
  });
} else {
  startServer();
}

Two traps to know:

  • Sticky sessions — if you’re using in-memory sessions, requests from the same user may land on different workers. Either move session state to Redis or use a sticky load balancer.
  • Shared state — anything in module-level variables is per-worker. Caches, counters, rate limiters — all of them duplicate.

3. Error boundaries that actually work

The Node runtime has two last-chance hooks before it hard-exits. Register them both, log aggressively, and do not pretend to recover — the process is in an undefined state by definition.

TypeScript
process.on("uncaughtException", (err) => {
  logger.fatal({ err }, "uncaughtException");
  setTimeout(() => process.exit(1), 200); // let the log flush
});
 
process.on("unhandledRejection", (reason) => {
  logger.fatal({ reason }, "unhandledRejection");
  setTimeout(() => process.exit(1), 200);
});

The supervisor (pm2 / k8s / systemd) will start a fresh process, and that’s fine. process.exit(1) after a short flush window is the only safe recovery strategy.

What about domain?

The domain module has been deprecated for years. Don’t use it. If you’re doing request-scoped error isolation, use AsyncLocalStorage instead.

4. Structured logging — not console.log

console.log writes strings. Your log aggregator wants JSON. Pick a real logger, pin a format, and stop splattering raw strings across stdout:

LoggerThroughput (logs/s)StructuredGood for
pino~1,000,000High-throughput services
winston~80,000Legacy, plugin-rich ecosystem
bunyan~300,000Teams already on bunyan
debugn/aLocal dev only

A production pino setup looks like:

TypeScript
import pino from "pino";
 
export const logger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  base: { service: "api", region: process.env.AWS_REGION },
  redact: {
    paths: ["*.authorization", "*.password", "*.token"],
    censor: "[REDACTED]",
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

Key things this buys you:

  • Every log line already has service and region set
  • Sensitive fields auto-redacted
  • ISO timestamps your aggregator understands without parsing

5. Graceful shutdown

When Kubernetes sends SIGTERM, you have ~30 seconds to finish in-flight requests and close everything cleanly. Most services shed traffic the instant they receive SIGTERM and then hard-exit — silently dropping in-flight work. That’s a bug.

TypeScript
import { createServer } from "node:http";
import { once } from "node:events";
 
const server = createServer(handler);
server.listen(PORT);
 
async function shutdown(signal: NodeJS.Signals) {
  logger.info({ signal }, "shutdown begin");
  server.closeIdleConnections();
  await once(server.close((e) => e && logger.error({ err: e })), "close");
  await db.end();
  await queue.disconnect();
  logger.info("shutdown complete");
  process.exit(0);
}
 
process.on("SIGTERM", () => shutdown("SIGTERM"));
process.on("SIGINT", () => shutdown("SIGINT"));

Order matters:

  1. Stop accepting new connections (server.close())
  2. Drain in-flight requests (waiting for the close event)
  3. Release external resources (DB pool, queue, Redis)
  4. Exit

If step 2 is still blocked after ~25 s, exit anyway — Kubernetes will SIGKILL you at 30 s regardless, and a clean-ish shutdown beats a crash-looking one every time.

6. Observability, not just logs

Logs alone are search-only. You want three signals:

  • Metrics — request rate, latency percentiles, error rate, event-loop lag
  • Traces — per-request spans across services (OpenTelemetry )
  • Logs — structured, already covered
  • Profiles — continuous CPU/heap profiling via 0x or Datadog Continuous Profiler

Event-loop lag is the one metric most teams don’t collect and then get surprised by:

TypeScript
import { monitorEventLoopDelay } from "node:perf_hooks";
 
const h = monitorEventLoopDelay({ resolution: 20 });
h.enable();
 
setInterval(() => {
  metrics.gauge("eventloop.lag.p99.ms", h.percentile(99) / 1e6);
  h.reset();
}, 5_000);

If the p99 lag creeps above ~50 ms, your Node process is CPU-bound — either move that work to a worker thread, offload it to a queue, or scale out.

Event loop lag over a typical deploy

Sample lag chart — flat is good, spikes are not.

7. The production-readiness checklist

Before routing real traffic at a service, run this list. Every line is something I’ve been bitten by at least once:

  • Process supervisor restarts on crash
  • uncaughtException and unhandledRejection hooks wired
  • Structured logging with redaction
  • Graceful SIGTERM handling with a budget
  • Health endpoints: /healthz (liveness) and /readyz (readiness)
  • Metrics exported (Prometheus format or vendor SDK)
  • Request-level tracing
  • Event-loop lag emitted as a metric
  • Outbound HTTP calls timeboxed with AbortController
  • Dependency pinning via lockfile (package-lock.json / pnpm-lock.yaml)
  • CI runs npm audit and a type-check
  • Chaos drill: kill -9 a pod under load, watch the SLO

The last one is the one everyone skips. Don’t skip it — the drill is where you find the actual bugs.

8. Quick-reference cheat sheet

Some things worth committing to muscle memory:

Want to…Tool / pattern
Restart on crashpm2 / systemd / container orchestrator
Use all CPU corescluster or pm2 -i max
Ship structured logspino + log aggregator
Trace requests across servicesOpenTelemetry SDK
Measure event-loop lagperf_hooks.monitorEventLoopDelay
Hot-reload during devnode --watch (Node ≥ 18.11)
Debug a running processkill -SIGUSR1 <pid> + Chrome devtools

To attach Chrome devtools to a running process: send SIGUSR1 to it, then open chrome://inspect. This works against unsupported every Node version since 8.


Further reading

These are the three documents I actually re-read every few months:

  1. Node.js docs — Async context tracking 
  2. Google SRE Book — SLOs 
  3. OpenTelemetry for Node.js 

If you’ve got a pattern from your own production stack that belongs on the checklist, I’d love to hear it — drop me a note via the links on the home page.

Rahul Gupta
share