Cache Invalidation at Scale: The Four Strategies That Survive Production

The "two hard things in computer science" joke is a meme because it is true. Most cache bugs I have debugged were invalidation bugs. The cache itself almost never fails. What fails is the contract between the thing that wrote the data and the thing that now has to forget it.

Teams usually adopt caching as a performance fix and treat invalidation as an afterthought. Six months later, someone is refunding a user who saw a stale balance, an admin is paging the on-call because the feature flag did not flip, or a marketing page is serving the pre-launch copy at 9:01 a.m. launch day.

The good news: you do not need a dozen strategies. Four actually survive production. Pick the weakest one that still meets your consistency bar, then harden it with a couple of well-known adjuncts.

1. The only question worth asking first

Before you pick a strategy, answer one question out loud.

How wrong is the cache allowed to be, and for how long?

That one sentence usually writes the rest of the design.

Session data stale by 30 seconds: fine.
A catalog price stale by 30 seconds: probably fine.
A feature flag stale by 30 seconds: sometimes catastrophic.
An authorization check stale by 30 seconds: almost always catastrophic.

Everything that follows is a way of bounding that wrongness. If you cannot answer that question, no caching strategy will save you, because you do not yet know what you are trying to guarantee.

2. Strategy one: TTL-only

The simplest thing that works. Every key gets a time-to-live, and the cache forgets it when the timer expires.

REDIS

SET user:42:profile "{\"name\":\"Ada\",\"plan\":\"pro\"}" EX 60

That is the whole design. No invalidation code path. The cache becomes correct again on its own, eventually.

When it is the right default

Data where staleness is measured in seconds and the product tolerates it.
Computed aggregates that are expensive to generate and change slowly.
External API responses where you control the cache but not the source.
Anything where "wrong for up to N seconds" is literally fine.

How it breaks

Invalidate-on-write is impossible. The cache has no idea a write happened.
TTL tuning is a dark art. Too short and the cache is useless. Too long and stale data bites you at the worst time.
Every TTL expiry is a potential stampede. If 10k keys expire at the same second and 10k requests race to rebuild them, your origin goes down.
Clock skew across nodes means "60 seconds" is a range, not a point.

Ops cost

Near zero to build. Modest to operate, because you will almost certainly need request coalescing and jittered TTLs once traffic grows. More on those below.

If your answer to "how wrong is it allowed to be" is a number of seconds, start here and only move up when you have a concrete reason.

3. Strategy two: write-through and write-behind

Here the application writes to the cache as part of the write path, not just the read path.

Write-through: write the DB and the cache synchronously. The cache is always consistent with the last committed write from this service.
Write-behind: write the cache, acknowledge, then flush to the DB asynchronously. Fast, but now the cache is the source of truth for a window, and you inherit all the consistency problems that implies.

I almost never use write-behind in general-purpose systems. It is a specialized pattern for write-heavy workloads where durability is handled somewhere else. Write-through, on the other hand, is a very sensible default when one service owns both the writes and the reads.

When it is the right default

A single service owns the record and all its mutations.
Reads vastly outnumber writes, and you want the cache hot immediately after a write.
The record is simple enough that the cache entry is a straightforward projection of the row.

How it breaks

If any other service or job writes the DB directly, the cache silently goes stale. This is the most common failure mode.
Multi-key updates are not atomic across the cache and the DB. A DB transaction can commit while the cache write fails, or vice versa. You end up reinventing two-phase commit badly.
Cache writes on the write path add latency and a new failure mode to every mutation.

Ops cost

Moderate. You need a strategy for partial failure (DB committed, cache write failed). The usual answer is "prefer to fail the cache write loud and let a short TTL backstop it." Do not silently swallow cache errors in write paths.

4. Strategy three: event-driven bust

The write path does not touch the cache. It emits an event, and a consumer invalidates cache keys in response.

Text

service A commits order
  -> publishes order.updated { orderId, tenantId, version }
  -> cache-invalidator consumes
  -> DEL order:<orderId>, order:list:<tenantId>, ...

This is the pattern I reach for when more than one service can mutate the data, or when a single write fans out across many cached views.

When it is the right default

Multiple producers write the underlying data.
One write affects many cached views (list pages, aggregates, denormalized reads).
You already have a broker in the stack and can afford another consumer.

How it breaks

Events arrive out of order. If you do not attach a monotonic version, a late updated can overwrite a fresher state. Compare versions before busting.
Events get lost, duplicated, or delayed. The cache can be stale for the length of your consumer lag. On a bad day, that is minutes.
Two writes race: one bust lands before the other write's commit is visible to readers. You bust, a reader repopulates from a replica that has not caught up, and now the cache is stale again with a fresh TTL.
No broker is free. Choosing between Kafka, Pulsar, and JetStream is a separate conversation, and I wrote about that elsewhere. For this pattern, any durable ordered stream works.

Ops cost

Real. You inherit everything that comes with event pipelines: DLQs, consumer lag alerts, idempotent handlers, replay tooling. The payoff is decoupling the write path from every cached view that derives from it.

The usual hardening: carry a version or commit timestamp on the event and check it against what is in the cache before deleting, so a late event cannot stomp on a newer value.

5. Strategy four: tag-based and surrogate-key invalidation

This is the one most teams do not know about until they need it. Instead of tracking which exact keys to invalidate when an entity changes, you tag cache entries with the entities they depend on. When an entity changes, you bust every entry that carries that tag.

Fastly and Varnish have had surrogate keys for years. Redis does not have first-class tags, but you can build them cheaply.

REDIS

SET product:17 "<json>" EX 300
SET catalog:page:3 "<json>" EX 300
SET search:shoes:men "<json>" EX 300

SADD tag:product:17 product:17 catalog:page:3 search:shoes:men
SADD tag:tenant:42 product:17 catalog:page:3 search:shoes:men

When product 17 changes, a single operation invalidates every dependent entry:

REDIS

EVAL "local keys = redis.call('SMEMBERS', KEYS[1])
      if #keys > 0 then redis.call('DEL', unpack(keys)) end
      redis.call('DEL', KEYS[1])
      return #keys" 1 tag:product:17

The same works in Postgres if you are using it as a cache table:

SQL

-- one row per cache entry, plus a mapping table
CREATE TABLE cache_entry (
  key        text primary key,
  value      jsonb not null,
  expires_at timestamptz not null
);

CREATE TABLE cache_tag (
  tag text not null,
  key text not null,
  primary key (tag, key)
);

-- invalidate every entry tagged with product:17
DELETE FROM cache_entry
 WHERE key IN (SELECT key FROM cache_tag WHERE tag = 'product:17');

DELETE FROM cache_tag WHERE tag = 'product:17';

When it is the right default

One logical change fans out across many cached projections.
You have CDN or edge caches that already support surrogate keys.
You need multi-tenant invalidation (bust everything for tenant X).
Feature flags, pricing changes, or catalog edits must propagate broadly.

How it breaks

Tag sets grow unbounded if you never expire them. Always attach a TTL to the tag set itself, slightly longer than the entries it points at.
A missed tag at write time is invisible. There is no warning that an entry should have carried a tag and did not. Tag application has to be part of the cache-write helper, never hand-written at call sites.
Bulk invalidations can stampede the origin. If one tag points at 50k entries, a single bust can take 50k keys cold at once. Pair with coalescing and SWR.

Ops cost

Higher than TTL-only, lower than a full event pipeline. The engineering investment is mostly up front: make the cache helper enforce tags and give it a single invalidateTag(tag) API.

6. Request coalescing: the thundering herd defense

Every caching strategy above has the same failure mode at the moment of a miss: thousands of readers all miss the same key and all race to the origin. The fix is simple and almost always worth it.

Only let the first miss rebuild. Make every other concurrent request wait for that result.

TypeScript

type Loader<T> = () => Promise<T>;

const inflight = new Map<string, Promise<unknown>>();

export async function coalesce<T>(key: string, load: Loader<T>): Promise<T> {
  const existing = inflight.get(key);
  if (existing) return existing as Promise<T>;

  const p = (async () => {
    try {
      return await load();
    } finally {
      // release AFTER the promise settles so late arrivers still share it
      inflight.delete(key);
    }
  })();

  inflight.set(key, p);
  return p;
}

// usage
async function getProduct(id: string) {
  const cached = await redis.get(`product:${id}`);
  if (cached) return JSON.parse(cached);

  return coalesce(`product:${id}`, async () => {
    const row = await db.product.findById(id);
    await redis.set(`product:${id}`, JSON.stringify(row), "EX", 300);
    return row;
  });
}

That is the in-process version. For cross-process coalescing, you need a short-lived lock in Redis (SET lock:product:17 1 NX PX 500) and a retry-with-backoff path for the losers. Libraries like singleflight in Go and dataloader in Node give you the in-process variant for free.

Coalescing will not fix a cache that is wrong. It will fix a cache that falls over on a miss.

7. Stale-while-revalidate

The other universal adjunct. Keep serving the stale value while you asynchronously refresh it.

Pseudocode:

Read from cache. If fresh, return.
If expired but within a grace window, return the stale value and kick off a background refresh.
If beyond the grace window, block and refresh.

HTTP has this baked in with Cache-Control: max-age=60, stale-while-revalidate=600. CDNs honor it out of the box. You can implement the same in Redis by storing { value, freshUntil, staleUntil } triples and deciding which branch to take on read.

This turns a TTL expiry from a latency cliff into a small background cost. Users almost never see a rebuild. Your p99 stops looking like a sawtooth.

The tradeoff is honest: the cache can serve data that is slightly older than max-age for up to the grace window. If that is unacceptable for the endpoint, do not use it there.

8. Stampede prevention with probabilistic early expiration

TTLs expire at a single instant. If a hot key expires at T, every reader at T+1ms misses together.

The fix is to treat expiry as a probability that ramps up as you approach the TTL. The XFetch algorithm is the textbook version: a reader occasionally chooses to refresh a key slightly before it expires, weighted by how close to expiry it is. The probability is near zero at T - 10s and close to one at T.

The effect: one lucky reader rebuilds the key a few hundred milliseconds early, the rest continue to hit. No thundering herd at the expiry boundary.

You do not need a library. Store the compute time of the last rebuild alongside the value, and let readers roll the dice. Even a coarse implementation measurably flattens origin traffic.

9. Negative caching

Cache misses are not the only thing worth caching. Cache the failures too.

If a product lookup returns "not found" or a downstream dependency throws, caching that result for a short window prevents a repeated hammer on a broken path. The important constraints:

Negative TTLs should be short. Seconds, not minutes. A user fixing the underlying issue should not wait out a 10-minute negative cache.
Distinguish "not found" from "backend error." A 404 can cache longer than a 503.
Never negative-cache auth failures. Revocations must propagate fast.

A good negative cache turns an incident where the origin is down into a brownout where most pages still render and only the genuinely-affected endpoints show errors.

10. Versioned keys for safe deploys

The single best trick for deploy-time cache correctness: put a version in the key.

TypeScript

const CACHE_VERSION = "v7";
const key = `${CACHE_VERSION}:product:${id}`;

When the shape of the cached value changes, bump CACHE_VERSION. The old cache ages out on its own TTL. The new code only ever reads and writes the new namespace. No coordinated flush, no "please clear Redis before we deploy," no mixed-shape reads during the rollout window.

Same trick works for user-scoped invalidation. Store user:42:cacheVersion and embed it in every key for that user. When you need to invalidate everything for that user, increment the version. Every existing key becomes unreachable atomically.

The cost is a small extra lookup (or a local cache of the version). The payoff is that you stop writing imperative invalidation code for the "something broad just changed" cases.

11. How these four combine in a real system

A non-trivial service almost always uses more than one of these. A realistic pattern:

Reference data (countries, currencies, feature flags with eventually-consistent semantics): TTL-only with a long grace, backed by SWR. Cheap, boring, correct enough.
User-owned records (profile, settings, session-scoped data): write-through from the service that owns them, versioned keys per user.
Multi-producer domain data (orders, inventory, catalog): event-driven bust with version checks, backed by a TTL as a last-resort correctness guarantee.
Derived views (list pages, search results, aggregated dashboards): tag-based invalidation on top of TTL, with coalescing on misses.

The TTL is almost always present as the backstop. Every other strategy can miss an invalidation under some failure mode. TTL guarantees that whatever went wrong becomes right again within a bounded window. That backstop is the reason you can sleep.

12. What I would not do

A few patterns that look appealing and almost always cause more pain than they save.

Cache-aside with no TTL. If you assume every invalidation path is perfect, one of them is quietly broken right now and you do not know which.
Manual per-endpoint invalidation. If your codebase has hand-written redis.del("cache:foo") calls scattered across handlers, you will miss one. Centralize invalidation behind a helper and make the helper the only way.
Write-behind as a default. The latency win is real. The data-loss surface is larger than most teams admit. Reach for it only when you have already solved durability elsewhere.
Unbounded tag sets. Tags without TTLs are memory leaks with an ops incident attached.
Busting the cache from inside a DB transaction. If the transaction rolls back after the bust, you have invalidated something that did not change. Bust after commit, not inside it.

13. The decision rule

The rule I keep coming back to:

Pick the weakest strategy that meets your consistency bar, then harden it with coalescing and SWR.

In order of strength (and cost):

TTL-only.
TTL + write-through for single-owner data.
TTL + event-driven bust for multi-producer data, versioned.
TTL + tag-based invalidation for fan-out views.

Coalescing and SWR sit under all four. They do not change what the cache contains. They change how the cache behaves at the hardest moments, which is miss storms and expiry cliffs.

If you start at (1) and move up only when reality forces you, you end up with a cache layer that is cheap to reason about, cheap to operate, and cheap to debug at 3 a.m.

Caching is not a performance problem with a clever library solution. It is a consistency problem with a latency budget. The four strategies above are the ones I have seen survive real traffic without becoming the reason an incident exists. Everything else is either a variation of these or a well-disguised way to lose data.