Graceful Degradation: What Circuit Breakers Miss

Every team hits the same reliability wall eventually. A downstream service gets slow, timeouts stack up, threads or goroutines start piling up, and the whole request path turns to glass. Someone mentions circuit breakers. A library gets added. A few diagrams get drawn.

A quarter later, the same outage happens with a different failure shape.

Circuit breakers are the 101 tool. They trip when a dependency is already broken and stop you from pouring good work into a broken pipe. They are useful. They are also the part of graceful degradation that teams finish first and then stop thinking about.

The real work starts when the system itself is the bottleneck, not the dependency. CPU is saturated. Event loop latency is climbing. A queue is growing faster than the consumer can drain it. A breaker tripping somewhere downstream does nothing for any of that.

Graceful degradation is not "fail fast when something else breaks." It is "get slower, then simpler, then narrower, and always return something useful before you die."

1. Circuit breakers solve one narrow problem

A circuit breaker is a state machine around a specific dependency call. Closed, open, half-open. It watches errors or latency for a single downstream and cuts the call when that downstream looks unhealthy.

That is a real problem. That is also all it does.

What breakers do not address:

the caller's own CPU and memory pressure
request queue growth inside your service
slow consumers on your side of the wire
overload from legitimate traffic with no downstream failures at all
work that is technically succeeding but costing more than it is worth

If your service is dying because 40k requests per second are arriving and you can only handle 18k, no breaker will save you. Every request is succeeding on paper. The machine is just on fire.

Breakers are a shield against downstream rot. Graceful degradation is a posture for the whole system.

2. The modes model: green, yellow, red

The cleanest way I have found to design for degradation is to define operating modes up front, as part of the service contract, not as a post-incident scramble.

Text

green   -> full feature set, full latency budget, full fidelity
yellow  -> partial responses, cached fallbacks, non-critical work deferred
red     -> read-only or cached-only, synthetic responses, no writes

Each mode defines:

which endpoints are available
which downstream calls are allowed
which background work is suspended
what the response shape looks like
what signals force the transition

A well-designed service ships the yellow and red modes with the green one. They are not emergency patches. They are product states.

Operators do not promote to red by editing config on a live box. A signal crosses a threshold, the mode flips, and the behavior changes in a way that was reviewed in code review.

This turns "the site is down" into "the site is in red mode for 11 minutes." Different conversation.

3. Load shedding beats queue-and-pray

The default behavior of most servers is bad. A new request arrives, a thread or goroutine picks it up, and it waits in line for a database connection or a downstream call. That line is almost always unbounded.

When traffic spikes, the queue grows. Latency grows with it. Clients time out, retry, and add even more work. The system runs hotter and hotter while succeeding less and less. This is the classic metastable failure. No dependency is broken. The system just cannot climb out.

Load shedding is the act of rejecting work you know you cannot serve in time, quickly, and cheaply, so the work you accept finishes well.

The key ideas:

reject at the edge, not after you have already paid the cost
reject with a clear signal (429, 503 with Retry-After)
reject the lowest-value work first
measure queue depth and age, not just CPU

The hard part is not the rejection. It is admitting that serving 70 percent of traffic well is better than serving 100 percent badly.

4. Adaptive concurrency, not fixed thread pools

Fixed concurrency limits are how most teams start. Pick a number, set the pool size, move on. That number is wrong within a quarter. The database got faster. A new endpoint got slower. A neighbor service changed latency. Your constant stopped matching reality.

Adaptive concurrency treats the limit like a TCP window. You probe, observe latency, and adjust. Netflix open-sourced this pattern in their concurrency-limits library and the core idea is simple: use observed latency and in-flight count to infer the current safe concurrency, and shed anything above it.

A minimal sketch in Go, using a Vegas-style additive increase, multiplicative decrease:

type AdaptiveLimiter struct {
	mu        sync.Mutex
	limit     int
	inFlight  int
	minRTT    time.Duration
	lastRTT   time.Duration
}

func (l *AdaptiveLimiter) Acquire() bool {
	l.mu.Lock()
	defer l.mu.Unlock()
	if l.inFlight >= l.limit {
		return false // shed
	}
	l.inFlight++
	return true
}

func (l *AdaptiveLimiter) Release(rtt time.Duration) {
	l.mu.Lock()
	defer l.mu.Unlock()
	l.inFlight--

	if l.minRTT == 0 || rtt < l.minRTT {
		l.minRTT = rtt
	}

	// queueing: estimated queue size vs ideal
	queueSize := float64(l.inFlight) *
		(1 - float64(l.minRTT)/float64(rtt))

	switch {
	case queueSize < 0.5:
		l.limit++ // headroom, probe up
	case queueSize > 2.0:
		l.limit = max(1, l.limit-1) // back off
	}
	l.lastRTT = rtt
}

The details vary. The principle does not. The server finds the concurrency at which latency stays near its observed minimum, and rejects anything past that.

Two things this gets you that fixed pools do not:

it tracks real capacity across deploys, dependency changes, and noisy neighbors
it sheds early, when queue size is small, instead of late, when latency is already ruined

If your service still uses a hand-tuned MaxConcurrent = 200, that is the first thing I would replace.

5. Priority queues: not all requests are worth the same

Once you accept that you will sometimes shed, the next question is which work to shed first.

The answer is almost never uniform. A healthcheck from a load balancer and a background analytics poll and a user-facing checkout request are not equivalent. Treating them as interchangeable requests-per-second is how critical flows lose to cron jobs during a spike.

Classify traffic into priority bands and shed from the bottom:

P0 user-facing critical path (auth, checkout, payments)
P1 user-facing non-critical (recommendations, history)
P2 internal async (workers, projections, indexing)
P3 best-effort (analytics, prefetch, warmup)

A simple YAML contract at the ingress layer:

YAML

priority_bands:
  p0:
    matches:
      - header: "x-request-class"
        equals: "critical"
      - path_prefix: ["/auth", "/checkout", "/pay"]
    min_concurrency: 80     # reserved even under pressure
    shed_order: last
  p1:
    matches:
      - path_prefix: ["/home", "/search"]
    min_concurrency: 0
    shed_order: 3
  p2:
    matches:
      - header: "x-request-class"
        equals: "worker"
    shed_order: 2
  p3:
    matches:
      - header: "x-request-class"
        equals: "background"
    shed_order: 1           # shed first

global:
  max_concurrency: adaptive
  shed_response:
    status: 503
    headers:
      retry-after: "2"

Two non-obvious rules I hold to:

reserved concurrency for P0 is not a suggestion, it is a hard floor
clients must tag their own requests; the server cannot guess priority reliably from path alone

The client tag is the part teams resist. They want the platform to magically know. The platform cannot. Make the contract explicit.

6. Bulkheads: one bad consumer does not sink the ship

Bulkheads come from shipbuilding. Seal the compartments. One flooded compartment does not drown the whole hull.

In software, a bulkhead is a dedicated resource pool for a class of work. Separate thread pools, separate connection pools, separate queues. When one class goes bad, it consumes its own pool and nothing else.

Common bulkhead boundaries:

per downstream dependency (one slow dependency cannot exhaust the main pool)
per tenant or customer (one noisy tenant cannot starve others)
per request class (background jobs cannot consume the connection pool the user-facing path needs)
per endpoint shape (expensive analytics endpoints get a smaller pool than cheap reads)

A classic failure without bulkheads: the recommendations service starts returning 30-second responses. Every request to your API tries to enrich the response, waits 30 seconds, and holds a database connection the entire time. The pool drains. Login starts failing because it cannot get a connection. A slow feature took down auth.

With bulkheads, the recommendations calls queue against their own pool, hit their own timeout, and get served a fallback. Login has its own pool and never notices.

The cost is complexity. More pools, more sizing decisions, more metrics. It is worth it. Bulkheads are what let one thing fail without everything failing.

7. Timeouts at every hop, with jitter

This one is boring and people still skip it.

every network call has an explicit timeout
timeouts shrink as you go deeper (inner calls time out before outer calls)
retries use exponential backoff with jitter
retries have a budget (retry ratio, not just retry count)

If your outer request budget is 800 ms, the database call inside it is not allowed to spend 2 seconds. If the token service has a 150 ms p99, the call to it does not get a 5 second timeout. The timeout is a promise about when you will stop waiting, not a cap on how bad things can get.

Retries are where people cause their own outages. A retry storm during a partial failure can multiply load 3x just when the system can least absorb it. Tie retries to a budget: "at most 10 percent of traffic is retries." When the budget is exhausted, failures propagate.

Jitter on backoff is not optional. Synchronized retries from thousands of clients recreate the spike they were supposed to smooth.

8. Partial responses beat full failures

A checkout page needs user info, cart, shipping estimate, recommendations, and a trust badge. If recommendations time out, the page does not have to 500. It has to render without recommendations.

Partial responses are a design decision. They require:

per-field or per-section fallbacks declared explicitly
a response shape where missing sections are allowed, not an error
a client that can render a degraded view without flickering
a clear signal to observability that a partial response was served

The backend shape:

TypeScript

type CheckoutResponse = {
  user: UserInfo;             // required
  cart: Cart;                 // required
  shipping: Shipping | null;  // may be null in degraded mode
  recommendations: Rec[];     // always safe to be empty
  trustBadges: Badge[] | null;
  degradedSections?: string[]; // ["recommendations", "trustBadges"]
};

The rule is simple. If the section is part of the core promise (user, cart), it is required and a failure means 5xx. If the section is supplementary (recs, badges, shipping estimate), it is nullable and its failure is a yellow-mode signal, not an outage.

Every time a section degrades, the response includes degradedSections so the client can log it, the frontend can decide what to show, and dashboards can count it. Partial responses that are invisible to operators are a different kind of debt.

9. Degraded modes are features, not hacks

The yellow and red modes only work if they are real code paths that get exercised.

Read-only mode. Writes return 503 with a meaningful message, reads continue. The write path can be flipped off at the gateway, or writes can be enqueued for later if your domain tolerates it.
Cached-only mode. Skip the origin. Serve from the nearest cache or CDN even if slightly stale. The freshness SLA gets relaxed and the system stops hammering the backend while it recovers.
Synthetic responses. Return a default that is safe. Empty arrays for list endpoints. A cached snapshot from an hour ago for a dashboard. A "service temporarily limited" flag the UI can render explicitly.
Feature flags wired to mode. Heavy features like personalization, full-text search rerankers, or ML scoring turn off automatically in yellow. They are nice-to-haves in green. They are off in yellow by default.

The test I apply is blunt. If a mode has never served real production traffic, it does not exist. The first outage is not the place to discover that your read-only path has a bug because no write has been rejected in staging for a year.

10. Testing degradation: chaos plus synthetic load

You cannot guess how your service behaves at 3x load. You have to induce it.

The practices that actually work:

Load tests that go past capacity, not up to it. The interesting part of the curve starts after p99 latency diverges. Test until the shedding kicks in, verify that P0 traffic survives, and measure how fast the system recovers when load drops.
Dependency failure injection. Kill a database replica. Inject 500 ms of latency into a downstream. Drop 5 percent of packets on a service mesh hop. See if the bulkheads hold.
Mode drills. Flip the service into yellow in a staging environment with production-shaped traffic. Watch what breaks. Flip to red. Confirm it can exit cleanly.
Client retry pathology tests. Simulate a buggy client that retries aggressively. Confirm the retry budget holds and that one client cannot bury the service.

The signal that a test is worth running is that you are nervous about it. If everyone is sure it will pass, it is not finding anything new.

11. The metrics that matter for degradation

Classic metrics miss the important signals.

What I watch:

queue depth and queue age per bulkhead
shed rate per priority band (not just total shed rate)
partial response rate per endpoint
current mode (green/yellow/red) as a first-class time series
adaptive limit value over time
retry budget utilization
timeout budget consumed by each hop of a representative request

Latency percentiles alone will not tell you the system is about to tip. Queue depth and shed rate will.

12. Design the failure mode before you design the happy path

The reason graceful degradation feels hard is that most services are designed for the green path first and retrofit failure afterward. That almost always produces a system where the happy path is elegant and the failure path is a collection of patched-in defensive code that has never been tested together.

Flip the order.

Before I write a new service, I answer:

what does this service do in yellow mode
what does it do in red mode
which dependencies, if slow, are tolerable and which are fatal
what the partial response looks like for every significant endpoint
which request classes are P0 and who tags them
what the shed signal looks like to clients
how the service exits each degraded mode

That list drives the real design. Happy-path code falls out of it almost for free, because once you know how the thing survives, the non-failure case is the easy part.

Circuit breakers are fine. Keep them. Use them for the specific problem they solve, which is cutting off a dead dependency. Do not confuse them with graceful degradation.

Graceful degradation is an architectural commitment to get slower, then simpler, then narrower, and always return something useful before the system falls over. It is load shedding with adaptive concurrency. It is priority queues with explicit contracts. It is bulkheads, honest timeouts, partial responses, and degraded modes that have actually served production traffic.

The rule I keep coming back to: design the failure mode before you design the happy path. Every system that has survived a real incident without waking me up was built that way. Every one that did not, was not.