Skip to content
back to writing
5 min readdistributed-systems · cockroachdb · database-internals

Why Distributed SQL Consistency is a Statistical Illusion

Strong consistency in CockroachDB and YugabyteDB relies on an uncertainty window tuned for a clock drift your cloud provider frequently exceeds.

RG
Rahul Gupta
Senior Software Engineer
share

Most teams that adopt distributed SQL for their core ledger discover that "strict serializability" is a statistical probability the moment their cloud provider's NTP daemon stutters. You read the Spanner paper, you deploy CockroachDB or YugabyteDB, and you assume the database handles the physics of time. It doesn't.

Google Spanner achieves external consistency because it relies on TrueTime—a combination of GPS and atomic clocks that physically guarantees a maximum time uncertainty. If you are running CockroachDB or YugabyteDB on AWS or GCP, you do not have TrueTime. You have Hybrid Logical Clocks (HLCs) backed by a software network time protocol (NTP).

When your database's consistency relies on software NTP, you are one noisy hypervisor away from serving stale reads to a financial ledger.

The 500ms safety net

Distributed SQL databases using HLCs combine your server's physical wall-time with a logical counter. To make this work safely, the database forces you to configure a maximum clock offset.

In CockroachDB, this is --max-offset. The default is 500 milliseconds. In YugabyteDB, this is --max_clock_skew_usec. The default is also 500 milliseconds (500,000 microseconds).

Think about what that default implies. The database assumes no two nodes in your cluster will ever disagree on the time by more than half a second. If a node detects its clock has drifted beyond this limit compared to its peers, it deliberately crashes itself to protect data integrity.

But the real danger isn't when the node crashes. The danger is the behaviour inside the "uncertainty window" right before it crashes, or when the underlying OS clock jumps without the database immediately realising it.

How the uncertainty window breaks ledgers

When a transaction reads a row in CockroachDB, it gets a read timestamp. If it encounters a value written by another transaction with a timestamp that falls within its uncertainty window (current time + max offset), the database cannot mathematically prove which transaction happened first.

The database handles this by performing an uncertainty restart. It waits out the uncertainty window or retries the transaction. This introduces latency, but it preserves serializability.

Here is the failure mode: If your actual clock drift stealthily exceeds the configured maximum offset, the uncertainty window is no longer wide enough to catch the overlap.

  1. Node A (clock is 600ms behind) writes a debit to an account.
  2. Node B (clock is accurate) reads the account balance.
  3. Node B's uncertainty window is 500ms. Because Node A's timestamp is 600ms in the past, Node B assumes the write is fully resolved and safe to ignore or misorder.
  4. Node B serves a stale balance. The customer withdraws funds they do not have.

Cloud NTP is lying to you

You probably run chronyd pointing to the AWS Time Sync Service (169.254.169.123). You run chronyc tracking on your EC2 instances, see an RMS offset of 0.5ms, and assume your 500ms database offset is incredibly conservative.

Shell
$ chronyc tracking
Reference ID    : A9FEA97B (169.254.169.123)
Stratum         : 3
Ref time (UTC)  : Thu Oct 24 14:32:01 2024
System time     : 0.000000041 seconds slow of NTP time
Last offset     : -0.000012345 seconds
RMS offset      : 0.000451230 seconds

This output is a trailing average. It hides the spikes.

Software NTP over a virtualised network stack is vulnerable to hypervisor pauses, vCPU starvation, and noisy neighbours. A heavy garbage collection pause on the hypervisor, a live migration of your VM, or an ARP storm on the VPC can freeze the guest OS clock.

When the VM wakes up, its clock is suddenly 700ms behind. chronyd will eventually notice and step or slew the clock, but during that reconciliation period, your database is serving traffic with an invalid HLC. The strict serializability guarantee is gone.

The PTP reality check

AWS recently started pushing Precision Time Protocol (PTP) hardware clocks into their EC2 instances via the Elastic Network Adapter (ENA). They didn't build this just to pad their feature list. They built it because financial and ad-tech customers running distributed databases were silently breaking consistency under heavy load.

PTP bypasses the hypervisor's software networking stack entirely. It provides hardware timestamping directly from the Nitro card to the guest OS.

If you are running a Spanner-clone on AWS for a tier-1 financial workload and you are still using standard NTP, your consistency guarantees are built on sand.

The contenders for time sync

MechanismAccuracyVulnerabilityDatabase Fit
TrueTime (Spanner)< 1ms (Guaranteed bounds)GPS spoofing, atomic clock degradationPerfect. The database waits out the exact bound.
Hardware PTP (AWS ENA)< 1ms (Typical)Hardware failure, driver bugsExcellent. Safe for aggressive HLC offsets (e.g., 50ms).
Software NTP (chronyd)1ms - 1000ms+ (Variable)Hypervisor pauses, network congestionRisky. Requires massive uncertainty windows (500ms+).

How to configure this in production

If you are moving to PTP, you need to reconfigure both the OS and the database.

First, you switch from chronyd to ptp4l and phc2sys to sync the OS system clock directly to the ENA hardware clock.

INI
# /etc/ptp4l.conf
[global]
slaveOnly 1
delay_mechanism E2E
network_transport L2
time_stamping hardware
step_threshold 0.00002

Once PTP is verifying microsecond accuracy, leaving your database's max offset at 500ms is actively harmful. A 500ms offset means uncertainty restarts will pause your transactions for up to half a second during contention.

You need to tighten the window.

For CockroachDB (v23.2+), drop the offset when starting the node:

Shell
cockroach start --max-offset=50ms ...

For YugabyteDB (v2.20+), set the tserver flag:

Shell
yb-tserver --max_clock_skew_usec=50000 ...

What I actually do

If I am building a ledger on CockroachDB or YugabyteDB today, I do not accept the default 500ms offset.

I mandate AWS instances that support ENA hardware timestamping. I deploy ptp4l. I drop the database --max-offset to 50ms.

By dropping the offset to 50ms, I am making a deliberate trade-off: I am choosing availability risk over consistency risk. If the clock drifts by 51ms, the database node will crash itself. I want it to crash. I would rather lose a node and take a brief latency hit during leader election than explain a silent double-spend to an auditor because a hypervisor paused for a second.

Rahul Gupta
share