How to Handle 1M+ Requests/Second in Node.js

Hitting 1M+ requests per second in Node.js is absolutely possible. Hitting it with predictable latency, graceful degradation, and an architecture your team can still operate six months later is the part that matters.

The first thing to get clear: a single Node.js process is not doing 1M RPS on its own. A serious setup reaches that number through horizontal concurrency, ruthless request-path simplification, aggressive offload to the edge, and disciplined backpressure. Node can own the control plane and still sit on the hot path for the data plane, but only if you stop treating it like a monolith.

1. Start with the real unit of scale

When teams say "our Node service should do 1M RPS", they usually mean one of three different things:

A single process should do 1M RPS. That is unrealistic for almost every non-trivial workload.
A single host should do 1M RPS. That is sometimes possible for extremely small responses on tuned hardware.
A fleet should sustain 1M RPS. That is the target most teams actually need.

For planning purposes, think in this model:

Text

global_rps = edge_cache_hits
           + load_balancer_terminated_fast_paths
           + Node fleet handled requests

If you need a million requests every second, you should expect a large chunk of those requests to die early:

at the CDN
at the L4/L7 proxy
in a local memory cache
in a rate limiter before application logic runs

If every request reaches your database or even your full Express/Fastify middleware chain, the architecture is already wrong.

2. Pick the right Node stack

At this scale, framework overhead matters. You do not want a deep middleware stack with dynamic object creation on every request.

My rule of thumb:

Use Fastify over Express for hot-path APIs
Prefer bare node:http or uWebSockets.js for ultra-thin edge-style handlers
Keep JSON schemas precompiled instead of validating ad hoc per request
Avoid runtime-heavy abstractions in the request path

A thin Fastify route is often enough:

TypeScript

import Fastify from "fastify";

const app = Fastify({
  logger: false,
  disableRequestLogging: true,
});

app.get("/healthz", async () => ({ ok: true }));

app.get("/lookup/:id", {
  schema: {
    params: {
      type: "object",
      properties: { id: { type: "string" } },
      required: ["id"],
    },
  },
}, async (request, reply) => {
  const data = await cache.get(request.params.id);
  if (!data) return reply.code(404).send({ error: "not_found" });
  return data;
});

await app.listen({ port: 3000, host: "0.0.0.0" });

That is already materially better than a generic stack with ten middlewares, string-based logging, and ORM work on every request.

3. Use every core, but isolate responsibilities

Node is single-threaded per event loop. That means one worker per core for network handling, and separate execution domains for CPU-heavy work.

Use cluster or a process manager that gives you equivalent behavior:

TypeScript

import cluster from "node:cluster";
import os from "node:os";

if (cluster.isPrimary) {
  const workers = os.availableParallelism();

  for (let i = 0; i < workers; i++) {
    cluster.fork();
  }

  cluster.on("exit", (worker) => {
    console.error(`worker ${worker.process.pid} exited`);
    cluster.fork();
  });
} else {
  await startHttpServer();
}

But don't stop there. Separate workloads by type:

HTTP workers handle network I/O only
Worker threads handle CPU-bound transforms, compression, or crypto if unavoidable
Queues handle async work that does not belong in the request lifecycle

The easiest way to kill throughput is to mix low-latency request serving with expensive JSON reshaping, PDF generation, image work, or synchronous crypto in the same event loop.

4. Make the hot path boring

The request path that sustains very high RPS is usually offensively simple:

Accept request
Parse minimal headers
Authenticate cheaply
Read from local or remote cache
Serialize fixed-shape response
Return

Everything else should be pushed away from the hot path.

Good signs:

Small payloads
Stable response shapes
Cacheable reads
Precomputed aggregates
No synchronous filesystem access
No per-request regex-heavy parsing

Bad signs:

ORMs doing lazy-loading under load
request-scoped dependency injection containers
dynamic feature flag evaluation from remote systems
constructing giant objects and deleting fields before response

At 1M+ RPS, "just one extra millisecond" is not a rounding error. It is a full second of aggregate compute per thousand requests.

5. Cache like you mean it

You do not scale to seven-figure RPS by making your origin faster. You scale there by ensuring the origin sees far fewer requests than users generate.

Think in layers:

CDN cache for public and semi-public content
reverse proxy cache for short-lived internal API responses
local in-process cache for tiny, immutable, frequently-read reference data
Redis or KeyDB for shared cacheable reads and counters

One useful pattern is stale-while-revalidate at the proxy or application layer:

TypeScript

type CacheEntry<T> = {
  value: T;
  expiresAt: number;
  staleUntil: number;
};

function readWithStale<T>(entry: CacheEntry<T> | null, now = Date.now()) {
  if (!entry) return { hit: false, stale: false };
  if (now <= entry.expiresAt) return { hit: true, stale: false, value: entry.value };
  if (now <= entry.staleUntil) return { hit: true, stale: true, value: entry.value };
  return { hit: false, stale: false };
}

That pattern matters because a cache miss storm is one of the fastest ways to melt your entire fleet.

6. Treat backpressure as a feature

A system capable of 1M RPS does not try to serve every request equally. It decides what to drop, what to delay, and what to protect.

That means:

rate limits at the edge
bounded queues
connection caps
fast 429 and 503 responses
circuit breakers around dependencies

If Redis or your database begins timing out, the Node layer should degrade immediately instead of waiting for a full pile-up.

A simple adaptive shedder can be enough:

TypeScript

let rejectNewRequests = false;

setInterval(() => {
  const lag = getEventLoopP99Ms();
  const cpu = getCpuUtilization();
  rejectNewRequests = lag > 40 || cpu > 0.85;
}, 1000);

app.addHook("onRequest", async (_request, reply) => {
  if (!rejectNewRequests) return;
  reply.code(503).send({ error: "overloaded_try_again" });
});

That looks blunt, and it is. Under real overload, blunt is often better than elegant.

7. Offload compression, TLS, and connection churn

Do not make Node handle work your proxy is better at.

The high-throughput layout usually looks like:

Text

client
  -> CDN
  -> edge/load balancer
  -> NGINX / Envoy / HAProxy
  -> Node workers
  -> cache / queue / database

Let the fronting proxy absorb:

TLS termination
HTTP/2 or HTTP/3 session handling
gzip/brotli compression
request buffering where appropriate
connection pooling to the application tier
coarse rate limiting

Node should spend its budget on application semantics, not certificate handshakes and compression bookkeeping.

8. Fix your data layer before tuning Node

Most high-scale Node bottlenecks are not actually in Node.

They are:

a database doing point reads without the right index
a cache with low hit ratio
cross-region latency
noisy-neighbor effects in Kubernetes
oversized payloads
a chatty service-to-service graph

If every request triggers a database roundtrip, your maximum RPS is the database's problem, not the runtime's.

The core strategy is:

Move repeated reads into cache
Precompute expensive joins and aggregates
Partition traffic by tenant, key, or geography
Keep data close to where the request is served

If you have to choose between shaving 5% off Node runtime overhead and eliminating one backend hop, eliminate the hop every time.

9. Watch the event loop, not just latency

Latency dashboards alone hide the reason a Node service is degrading. Event loop lag tells you when the process is losing control of its scheduling budget.

TypeScript

import { monitorEventLoopDelay } from "node:perf_hooks";

const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();

setInterval(() => {
  metrics.gauge("node.event_loop.p99_ms", histogram.percentile(99) / 1e6);
  histogram.reset();
}, 5000);

Track at least:

requests per second
p50, p95, p99 latency
event loop lag
open connections
active handles
GC pause time
cache hit ratio
upstream dependency latency
rejection rate from load shedding

If you are not graphing rejected requests, you are blind to whether your overload controls are helping or merely hiding failure.

10. Tune the host, not just the app

At very high throughput, OS and network defaults matter.

Typical areas worth reviewing:

file descriptor limits
socket backlog
ephemeral port exhaustion
NIC offload settings
IRQ balancing
somaxconn
TCP TIME_WAIT behavior
container CPU throttling

For example, your server backlog and open file limits must align with real traffic:

Bash

ulimit -n 1048576
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.ip_local_port_range="2000 65000"

Do not cargo-cult kernel flags from random blog posts. Benchmark each change on hardware and traffic patterns that resemble production.

11. Benchmark with production realism

Most "Node handled 1M RPS" demos measure the easiest possible thing:

tiny static payload
no auth
no cache miss
no database
one route
localhost networking

Those benchmarks are useful only for upper-bound mechanics. They are not proof your service design is ready.

When load testing, vary:

payload sizes
hot keys versus cold keys
keep-alive versus connection churn
success versus failure paths
cache-hit and cache-miss ratios
multi-tenant noisy-neighbor scenarios

And test failure explicitly:

Kill a cache node
Add 50 ms latency to a dependency
Reduce one availability zone
Force a deploy during peak load

If the system only survives the happy path, it does not really survive 1M RPS.

12. A reference architecture that works

For a practical high-throughput Node deployment, this is the shape I would start from:

Text

CDN
  -> Envoy / NGINX layer with TLS termination and edge rate limiting
  -> Kubernetes service or L4 balancer
  -> Node Fastify pods, one worker per core
  -> Redis/KeyDB for hot reads and counters
  -> Kafka/NATS/SQS for non-blocking async work
  -> partitioned database tier for true source-of-truth reads

Operational rules:

stateless app pods
no in-memory sessions
no blocking work in handlers
all expensive side effects through queues
all hot reads cache-first
autoscaling driven by CPU, event-loop lag, and saturation signals together

This is not glamorous architecture. That is exactly why it scales.

13. The uncomfortable truth

The trick to handling 1M+ requests per second in Node.js is not a secret V8 flag or a different promise library.

It is mostly this:

simplify the request path
drop work early
push work to caches and proxies
scale horizontally
instrument overload before users find it
keep Node away from CPU-heavy and blocking tasks

Node is very good at high-concurrency I/O. It is unforgiving when teams ask it to behave like an all-purpose compute engine under extreme load.

Respect that boundary and it will scale much further than most teams expect.

If you're designing for traffic in this range, start by measuring where requests die today: CDN, proxy, cache, origin, database. The path to 1M RPS is usually not "make Node faster." It is "stop spending Node on work the rest of the stack should have absorbed already."