Most teams that adopt Change Data Capture discover the real risk the moment their Kafka Connect cluster drops offline at 2 AM. You wire up Debezium to Postgres, stream your updates to Snowflake, and assume you have solved real-time analytics. Then a bad deserialisation kills your connector, the logical replication slot stops acknowledging LSNs, and Postgres obediently hoards Write-Ahead Logs until the disk hits 100% and your primary database shuts down.
A logical replication slot is not a pub/sub topic. It is a distributed lock on your primary database's storage.
When you create a slot using pgoutput or wal2json, you are entering a strict contract. You are telling Postgres: Do not delete any WAL files until my consumer explicitly says it has processed them. If your consumer dies, pauses, or falls behind, Postgres will hold those files forever.
The Postgres 17 False Sense of Security
For years, the biggest operational headache with Postgres CDC was failovers. If your primary node died, the logical replication slot did not fail over to the replica. When the standby was promoted, Debezium would wake up, find no slot, and force you to run a massive, database-crushing initial snapshot to resync.
Postgres 17 finally fixed this. The release introduced logical slot synchronisation (sync_replication_slots). Your logical slots are now continuously mirrored to your physical standbys.
This solves the failover problem. It also creates a terrifying false sense of security.
High availability for slots just means your suicide pact is now highly available. If a downstream consumer breaks and stops advancing the LSN, the primary hoards WAL. Because the slot is synchronised, the standby also hoards WAL. When your primary runs out of disk space and crashes, your failover automation will dutifully promote a standby that is exactly three minutes away from also running out of disk space.
The Three Ways You Will Crash
In BFSI and fintech environments, database downtime is measured in regulatory incident reports. You cannot let an analytics pipeline take down the ledger.
If you do not explicitly manage slot retention, you will hit one of these three failure modes:
| Failure Mode | Root Cause | Symptom | Time to Outage |
|---|---|---|---|
| The Silent Death | Kafka Connect worker is up, but the specific connector task fails and pauses. | Slot lag grows linearly. Disk fills at the rate of your write throughput. | Hours to Days |
| The Filter Trap | Debezium is filtering out high-volume tables. LSN only advances on matching writes. | No matching writes occur for hours. Postgres retains all WAL generated in the interim. | Days |
| The Transaction Spike | A batch job updates 50 million rows. Network bandwidth to Kafka cannot keep up. | Slot lag spikes instantly. Disk fills before the network buffer drains. | Minutes |
The 64GB Kill Switch
Postgres 13 introduced the fix, but shockingly few teams configure it.
You must set max_slot_wal_keep_size. This defines the maximum amount of WAL Postgres will retain for a lagging logical slot before it decides the consumer is dead and drops the slot.
# postgresql.conf
# Set this to roughly 20-30% of your total disk space
max_slot_wal_keep_size = 64GBThis configuration forces a hard trade-off. If your slot falls 64GB behind, Postgres will invalidate it. When Debezium finally comes back online, it will crash with ERROR: requested WAL segment has already been removed. You will have to trigger a full snapshot to recover the CDC pipeline.
You are trading a 12-hour CDC backfill for your primary database staying online. If you are running a production system, this is the only acceptable trade.
You need to monitor exactly how close you are to this cliff. Alert on this query:
SELECT
slot_name,
plugin,
state,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS wal_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 / 1024 AS lag_gb
FROM pg_replication_slots
WHERE database = current_database();The Debezium Heartbeat Requirement
Even with the kill switch in place, you can still trigger false positives if you don't configure your consumer correctly.
Debezium only acknowledges a new LSN when it processes an event. If you configure Debezium to only capture a low-traffic users table, and no one signs up for three hours, Debezium processes nothing. Meanwhile, your high-traffic transactions table is generating 50GB of WAL an hour. Postgres will hold all 150GB of that WAL because the slot's LSN hasn't moved.
To fix this, you must force Debezium to generate artificial traffic. As of Debezium 2.4+, you configure heartbeats to write to a dummy table in the source database. This forces a continuous stream of commits, ensuring the LSN advances even when your target tables are idle.
{
"name": "postgres-cdc-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "db.internal",
"heartbeat.interval.ms": "10000",
"heartbeat.action.query": "INSERT INTO cdc_heartbeat (id, ts) VALUES (1, NOW()) ON CONFLICT (id) DO UPDATE SET ts = EXCLUDED.ts"
}
}What I actually deploy
If you are building a tier-zero system where Postgres is the source of truth, the database's survival overrides downstream data freshness.
I run Postgres 16 or 17. I set max_slot_wal_keep_size to exactly 25% of the provisioned disk. I configure Debezium with a 10-second heartbeat query. I route PagerDuty alerts to the data engineering team when slot lag hits 10%, and to the database infrastructure team when it hits 20%.
If the lag hits 25%, Postgres kills the slot, the database survives, and the data engineers spend their Monday running a Debezium snapshot. Never let a downstream analytics pipeline hold your operational datastore hostage.