Skip to content
back to writing
4 min readapache-iceberg · data-engineering · flink

Why 10-Second Commits Will Break Your Apache Iceberg Tables

Streaming micro-batches into Iceberg creates a metadata swamp that destroys query performance. Here is how to decouple ingestion from compaction.

RG
Rahul Gupta
Senior Software Engineer
share

Databricks acquiring Tabular and the GA of Snowflake Polaris have cemented a new default architecture: stream everything directly into Apache Iceberg. You wire up Flink to Kafka, set your checkpoint interval to 10 seconds, and watch the data land. Three weeks later, a simple SELECT COUNT(*) on a single partition takes 45 seconds just to plan the query.

Iceberg looks great in the docs. Then you put it behind real load, and the metadata kills you.

If you are building a real-time pipeline for a fintech ledger or a risk dashboard, treating Iceberg like a high-throughput message broker is a trap. High-frequency commits create a metadata swamp that will degrade your read performance faster than the data volume itself.

The anatomy of a metadata swamp

Iceberg tracks state using a tree of metadata files, manifest lists, and manifest files. Every commit creates a new snapshot.

When you configure Flink 1.18 to checkpoint and commit to Iceberg every 10 seconds, you are making 6 commits a minute. That is 360 an hour. 8,640 a day. For one table. If your Flink job has parallel subtasks, each commit writes multiple tiny Parquet files.

Query engines like Trino or StarRocks have to parse this tree to plan the query. They need to read the manifest lists to find the manifests, and read the manifests to find the data files, applying partition pruning along the way. When a table has 250,000 tiny Parquet files and 40,000 manifest files, the planner chokes. I have seen Trino coordinators OOM (Out of Memory) entirely just trying to figure out which files to read for an end-of-day regulatory report.

You are trading write latency for read latency, but the exchange rate is terrible.

The naive fix: Just run compaction

The moment teams notice the query degradation, they set up an Airflow DAG to run Spark's RewriteDataFiles procedure every hour.

SQL
CALL catalog.system.rewrite_data_files(
  table => 'fintech_prod.ledger.transactions',
  strategy => 'binpack',
  options => map(
    'target-file-size-bytes', '134217728',
    'min-file-size-bytes', '20971520'
  )
);

If you are running batch jobs, this works. If you are streaming at 10-second intervals, this script will fail constantly.

Iceberg handles concurrent writes via Optimistic Concurrency Control (OCC). When your Spark compaction job starts, it reads the current snapshot. It spends 15 minutes rewriting thousands of small files into optimal 128MB chunks. When it tries to commit, it discovers Flink has already committed 90 new snapshots.

Spark retries. Flink keeps writing. The compaction job eventually throws a CommitFailedException because it cannot resolve the conflicts fast enough. Your small files keep piling up.

Decoupling ingestion from the read path

The root cause is architectural. You are running your compaction plane on the exact same timeline as your ingestion plane, and exposing the intermediate mess to your read plane.

You need to isolate the high-frequency writes. As of Iceberg 1.2 (and highly stable in 1.5+), the cleanest way to do this is using the Branching and Tagging features.

Instead of writing your Flink stream directly to the main branch, you write it to a hidden raw_stream branch. Your query engines continue to serve ad-hoc queries from main. You run your compaction jobs against raw_stream, and then fast-forward main only when the files are right-sized.

You tell Flink to target a specific branch in the table properties.

SQL
CREATE TABLE iceberg_transactions (
  transaction_id STRING,
  account_id STRING,
  amount DECIMAL(10, 2),
  event_time TIMESTAMP(3)
) WITH (
  'connector' = 'iceberg',
  'catalog-name' = 'polaris',
  'catalog-type' = 'rest',
  'uri' = 'https://...',
  'branch' = 'raw_stream' -- Isolate the high-frequency commits
);

2. Compact and fast-forward

Your compaction job now operates on raw_stream. Because main isn't moving, you don't break downstream readers. Once the compaction completes, you update the main branch reference to point to the newly compacted snapshot.

SQL
-- Run compaction on the stream branch
CALL catalog.system.rewrite_data_files(
  table => 'fintech_prod.ledger.transactions@raw_stream'
);

-- Fast-forward main to expose the clean data
CALL catalog.system.fast_forward(
  table => 'fintech_prod.ledger.transactions',
  branch => 'main',
  to => 'raw_stream'
);

This is Write-Audit-Publish (WAP) applied to streaming infrastructure. It hides the operational debt of micro-batching from your users.

Comparing the architectures

StrategyQuery PerformanceCompaction Conflict RateOperational Overhead
Direct Stream (10s)Terrible. Metadata swamps planners.N/A (No compaction)Low initially, fatal later.
Stream + Async RewriteUnpredictable.High. Constant OCC failures.High. Endless DAG retries.
Branch-and-MergeExcellent. Readers only see optimized files.Low. Compaction runs isolated.Medium. Requires branch management.

What I actually build for BFSI workloads

If a trading desk or a fraud detection service tells me they need sub-second or 5-second latency, I do not use Iceberg. I route that data to Apache Kafka for stream processing, or Apache Pinot if they need OLAP queries on the live feed. Iceberg is an object store format, not a message bus.

If a risk team needs 1-minute to 5-minute latency for dashboards, Iceberg is viable, but I strictly enforce the branch-and-merge pattern.

I set Flink checkpoints to 60 seconds (not 10). I write to a raw_stream branch. I run a dedicated Spark structured streaming job or a continuous tabular service that compacts raw_stream and fast-forwards main every 5 minutes.

Never commit to your main branch faster than your query engines can parse the metadata. Never let ingestion dictate your storage layout. Decouple them, or you will spend your next quarter debugging query planner timeouts.

Rahul Gupta
share