When you design a table for high-throughput data, you face a question that has defined database architecture for decades: optimize for writes or optimize for reads. If you optimize for writes, your ingestion pipeline stays fast but your analytical queries slow to a crawl on large datasets. If you optimize for reads, your aggregations run in milliseconds but every insert becomes expensive.
CrateDB addresses that trade-off through a hybrid storage architecture. Data arriving in CrateDB is simultaneously stored in both row-based and columnar formats on the same table: no ETL pipeline, no separate system, no schema redesign required. Ingestion is moderately more expensive than a pure row store. The payoff is that point lookups and column-selective aggregations are both served from the optimal storage format immediately, without a batch step in between.
What Is a Columnar Database?
A columnar database stores each column of a table separately on disk rather than storing complete rows together. That sounds like an implementation detail. The performance consequence is significant.
Consider a sensor data table with 20 columns: device_id, timestamp, temperature, humidity, pressure, voltage, and 14 more. In a row-based system, every row is written and stored as a complete unit. When you run a query that reads only one column (temperature), the database still loads the full row for every record it scans. At 1 billion rows, that means reading the full width of 1 billion records to extract one column of data.
In a columnar database, the temperature column is stored as a contiguous block on disk. That same query reads 1 column out of 20. The I/O is reduced by roughly 95%.
-- Average temperature across 1 billion sensor readings — last 30 days SELECT AVG(temperature) FROM sensor_data WHERE event_time > NOW() - INTERVAL '30 days';
On a row-based system, this query touches every column in every row scanned. On a columnar system, it reads only the temperature and event_time columns. That difference in read volume is the foundation of why columnar storage makes analytics faster.
Why Columnar Storage Also Compresses Better
The performance advantage extends beyond I/O reduction. When similar values are stored together, all temperature readings in sequence, all device_id values in sequence, standard compression algorithms work far more efficiently.
A column of floating-point sensor readings that vary between 18.0 and 24.5 compresses much tighter than a row that mixes a 64-bit timestamp, a UUID string, three floats, an integer, and a boolean in sequence. Columnar storage lets databases apply type-aware compression to each column independently.
Smaller data on disk means fewer reads per query, which means faster results at lower infrastructure cost as data volume grows.
Where Row-Based Storage Still Wins
Columnar storage is not the right answer for every workload. For point writes, inserting one complete record at a time and reading one complete record by its primary key, row storage is faster. The complete record is written and read as a unit, which is exactly the access pattern row storage is built for.
This is why traditional relational databases use row storage: they were built for OLTP workloads where an application creates a user record, reads it, updates it. The row is the unit of work.
The problem emerges when those same databases handle analytical queries: aggregate sensor readings across millions of devices, compute rolling averages over a month of events, group records by a high-cardinality dimension. Those queries access a few columns across many rows. This is the opposite of what row storage optimizes for.
The Trade-Off That Used to Force a Decision
For most of the history of analytical database systems, you had to choose. Use a row-based database and accept slow analytical queries. Or use a columnar database and accept slower writes.
Many teams solved this by running two separate systems: a row-based operational database for writes and a columnar data warehouse for analytics, with a batch ETL pipeline moving data between them. Freshness lag ran to hours or days. Operational overhead meant two ingestion pipelines, two sets of indexes, and data that was never quite in sync across systems.
This architecture is documented and common. It is also unnecessary for operational analytics workloads where query freshness is measured in seconds, not hours. For a detailed breakdown of where traditional columnar systems specifically fail under continuous ingestion, see why traditional columnar databases struggle with real-time analytics.
How CrateDB Combines Both Storage Models
CrateDB uses a hybrid storage architecture that applies both approaches to the same dataset. When new data arrives, CrateDB writes it simultaneously to both a row-based store and a columnar store, and indexes it, all on the same table.
For an aggregation query like SELECT AVG(temperature) FROM sensor_data, the query planner uses columnar storage: it reads only the relevant column and applies vectorized batch processing across the stored column values. For a point lookup by primary key, it uses the row-oriented index.
-- Point lookup — uses row-oriented access path SELECT * FROM sensor_data WHERE device_id = 'sensor-0042' ORDER BY event_time DESC LIMIT 1;
The application sends SQL and gets results. The storage decision is internal to CrateDB.
The full architecture is documented on the columnar and row-based storage page. The columnar database guide covers the definition and use-case context for teams in earlier stages of evaluation.
The Practical Outcome: One Table for Both Workloads
The architectural implication is direct. You do not need to pre-aggregate data before you can query it at sub-second latency. You do not need to choose a schema that prioritizes write speed at the expense of read speed. You do not need a batch ETL step to move data from your operational store to your analytical store before a dashboard can display fresh results.
Data lands in CrateDB and is queryable immediately. An aggregation query uses columnar access paths. A device-lookup query uses row-based access paths. Both run against the same table, using the same SQL interface.
At scale, this reduces operational overhead directly. One system instead of two means one ingestion pipeline to monitor, one schema to maintain, one set of query patterns to optimize. The distributed execution layer, described in the distributed database architecture overview, handles query distribution across shards in parallel. This is how CrateDB achieves sub-second aggregation latency on datasets spanning billions of rows. For query engine mechanics, see the distributed query engine documentation.
What Columnar Storage Does Not Change
One clarification worth stating directly: columnar storage is a storage and I/O optimization, not a universal performance multiplier. A query that requests every column from every row in a large table still has to read all the data. The gains appear when queries are selective about columns accessed, which is the typical shape of analytical aggregations.
CrateDB is also not a data warehouse. It is not designed for cold storage tiering, long-term archival cost optimization, or the batch BI query patterns that run for minutes against months of historical data. It handles the operational analytics layer: data that is live, applications querying it now, and latency measured in milliseconds. For teams that need both, CrateDB sits between the ingestion pipeline and the warehouse, serving live workloads before data reaches long-term storage.
Run the Query on Your Own Data
The fastest way to evaluate how columnar storage performs for your workload is to run it. Start with cratedb.com/explore: bring your schema, run the SQL, and see the query latency on data arriving in real time. No pre-aggregation required upfront.
If you are in deployment mode, start free on CrateDB Cloud or run CrateDB with Docker and connect from any PostgreSQL-compatible client.
Frequently Asked Questions
A columnar database stores each column separately on disk rather than storing complete rows together. When a query reads one or two columns, it skips the rest entirely. This reduces I/O proportionally to the number of columns skipped: a 20-column table scanned for one column reads roughly 5% of what row storage would read.
Row storage writes and reads complete records as a unit, efficient for workloads that insert or retrieve individual rows by primary key. Columnar storage groups all values for a single column together, efficient for analytical aggregations that scan one or a few columns across large numbers of rows.
Both, in a hybrid architecture. New data arrives row-wise for fast ingestion throughput, then data blocks are organized and stored columnar on disk for analytical reads. The query planner selects the optimal access path automatically: aggregation queries use columnar paths, point lookups use row-oriented indexes.
Use columnar storage for analytical workloads: aggregations, GROUP BY operations, and scans across millions of rows accessing a few columns. Use row storage for point-lookup workloads: single-row inserts, primary-key lookups, and frequent record updates.
CrateDB's hybrid model applies each access pattern to the same dataset automatically without requiring a separate analytical store.
No. CrateDB handles the operational analytics layer: live data, millisecond latency, queries from running applications and dashboards. A data warehouse handles batch-oriented historical analysis and long-term archival. CrateDB sits between the ingestion pipeline and the warehouse, serving live workloads before data reaches long-term storage.