The cluster we almost provisioned
A nightly job was getting slow. It read about 35 gigabytes of event Parquet from object storage, joined it against a customer table, and rolled everything up by account and day before writing the result back for the dashboards. On a single Python worker with pandas it took 50 minutes, and it fell over roughly once a week when a big day pushed it past the box's memory.
So someone put "move analytics to Spark" on the roadmap. We sized an EMR cluster. We talked about who would own it, how we would test it, what the Airflow operator looked like. Two engineers were about to spend a quarter building a distributed data platform for a job that processes less data than my laptop's SSD holds.
I asked for a week to try something dumber first. I pointed DuckDB at the same Parquet files. The job finished in 90 seconds on the box we already had.
You probably don't have big data
The phrase "big data" got fixed in everyone's head around 2012, when a single server had maybe 64 GB of RAM and spinning disks. A 35 GB join genuinely was a cluster problem then. You needed many machines because no single machine could hold the working set.
That constraint quietly died. A current cloud box will give you 256 GB of RAM and several gigabytes per second of NVMe bandwidth without anyone signing off on a capital expense. Datasets grew too. For most companies, though, the analytical working set never kept pace with what one machine can now hold. Your fact table is tens of gigabytes. Your dimension tables are smaller. The query that feels heavy is heavy because the tool is wasteful, not because the data is large.
Run the numbers on your own warehouse some time. Pull the row counts and average widths of the tables your dashboards actually touch. A surprising amount of big-data infrastructure exists to shuffle a few tens of gigabytes around a network that a single CPU could have scanned in the time it took to schedule the first task.
Why one process is suddenly enough
DuckDB is an analytical database that runs inside your process. No server, no cluster, no JVM warming up. You import it the way you import a JSON parser, and it executes SQL against files on disk or in object storage.
Two design choices make it fast. It stores and processes data by column instead of by row, so a query that touches three of forty columns reads only those three. Vectorized execution does the rest, pushing thousands of values through tight loops that keep the CPU cache and SIMD units busy instead of interpreting one row at a time. Those are the same tricks the expensive warehouses use. DuckDB does them on the machine you are already paying for.
It reads Parquet, CSV, and JSON directly, and it can query a live Postgres table over the wire. There is no load step. You point a SELECT at the file and the file is the table.
-- Query 35 GB of Parquet in object storage, no load, no cluster
SELECT account_id,
date_trunc('day', event_ts) AS day,
count(*) AS events,
sum(amount) AS revenue
FROM read_parquet('s3://events/2026/*/*.parquet')
GROUP BY 1, 2;The pipeline it replaced
The old job was a pandas script. It loaded every Parquet file into one giant in-memory frame, merged that against a customer export, and grouped. Most of its runtime went on holding the whole dataset in memory at once and on Python's per-row overhead. When a day's data spiked, the load blew past the memory limit and the pod got killed.
The replacement is a SQL file. DuckDB streams the Parquet from object storage, joins it against the customer table I pull from the Postgres replica, and writes an aggregated Parquet result back. It spills to disk on its own when a step does not fit in memory, so the job that used to die on big days now just runs a little slower instead.
-- customers from the Postgres replica, events from object storage,
-- joined in one query on one node
ATTACH 'postgres://reader@replica/app' AS pg (TYPE postgres, READ_ONLY);
COPY (
SELECT e.account_id, c.plan,
date_trunc('day', e.event_ts) AS day,
count(*) AS events, sum(e.amount) AS revenue
FROM read_parquet('s3://events/2026/*/*.parquet') e
JOIN pg.public.customers c USING (account_id)
GROUP BY 1, 2, 3
) TO 's3://rollups/daily.parquet' (FORMAT parquet);The numbers that ended the debate
I ran the old script, the DuckDB version, and a small Spark cluster against the same 35 GB so the comparison was honest. The pandas job took 50 minutes when it survived. A three-node Spark cluster managed about four minutes once the executors spun up, and keeping it warm cost real money. DuckDB on the existing 16-core box finished in 90 seconds.
The Spark number is not an insult to Spark. It is a tax. A good part of those four minutes was scheduling, JVM startup, and shuffling data across a network that a single machine never had to touch. For 35 GB, the coordination cost more than the work did.
Where the single node loses
This stops being true at some point, and pretending otherwise would be the same mistake pointed the other way. When the data you must scan in one query runs genuinely into the terabytes, you want many disks and many CPUs reading in parallel, and a cluster earns its keep. The crossover sits higher than most people guess, often well past where they reach for Spark, but it is real.
A single box also gives you a single failure domain. If the machine dies mid-job, the job dies with it. For a nightly batch that just reruns, who cares. For something that must make continuous progress across hours, the retry story a distributed engine gives you starts to matter.
DuckDB is not a warehouse
The trap is to fall in love with the 90 seconds and try to make DuckDB the shared serving layer for the whole company. It does not want that job. One process writes to a DuckDB file at a time. It works best as an embedded engine sitting close to a single workload. A multi-tenant server fielding a hundred concurrent analysts is a different tool.
Keep it where it shines: batch transforms, a local copy of production analytics on a laptop, an embedded query engine inside an app, the heavy lifting behind one dashboard refresh. When you need many writers and many concurrent readers against shared state, that is what an actual warehouse is for. MotherDuck exists if you want the DuckDB engine with a managed server in front of it.
When you should still reach for the cluster
Provision the distributed thing when the facts call for it. Terabyte-scale scans per query qualify. So does a real need for many machines writing concurrently, or a streaming workload that can never pause to rerun. If three separate teams hammer the same tables all day with unpredictable queries, a shared warehouse with its own resource governor will save you grief.
None of those describe a 35 GB nightly rollup. Most pipelines I see are that rollup, dressed up in cluster clothes because the reference architecture diagram showed a cluster.
What I actually run now
My rule is a size test. Under a few hundred gigabytes per query, I start with DuckDB on the biggest single box that makes sense, and I do not apologize for it. The pipeline is a SQL file in version control, it runs in a plain container on a schedule, and there is no cluster to patch, no executors to size, no Airflow operator that only one person understands.
The Spark conversation is still sitting on a roadmap somewhere. We never had it. That 35 GB job has run every night for months in under two minutes, on a box that was already there, and the quarter we would have spent building a data platform went into the product instead.
Comments (0)