Introducing “Distributed Hash Join” for Faster Queries
CrateDB is revolutionizing the way businesses handle complex queries with its Distributed Hash Join technology—delivering unmatched speed and scalability for large-scale analytics! Already successfully using distributed inner joins but where other joins were slower to execute, the v5.10 release further enhances the technology for outer joins.
Why Does This Matter? In today’s data-driven world, businesses need to analyze massive datasets in real time without bottlenecks. Whether you’re working with time-series, tabular, JSON, geospatial, full-text or vector data, CrateDB offers a real-time unified data layer for analytics, search, and AI, where speed of complex joins at scale is essential to ensure that queries remain fast, even as data volume grows:
- Distributed Hash Inner Joins: These are used when only matching records from both tables are needed. Since CrateDB distributes the workload across nodes, inner joins can be executed in parallel, ensuring fast analytics even for massive datasets.
Example: A smart grid analytics platform joining live sensor readings with device metadata to detect anomalies in power consumption. - Distributed Hash Outer Joins: These help to preserve unmatched records, which is crucial in analytics when missing or incomplete data needs to be retained for further insights.
Example: A telecommunications provider joining network performance logs with customer complaints, ensuring that even network zones with no reported issues are still analyzed.
By handling both types of joins at scale and with parallel execution, CrateDB ensures that real-time analytics queries remain fast, even as data volume grows.
Reducing storage size by 50% for more aggressive cost reduction
Simultaneously, CrateDB has made substantial improvements on reducing the volume size. Instead of storing data twice for indexing, in the table itself and as an index used for querying, the original record has been rebuilt from the indexes which eliminates the need to store the original source record at all! This new table storage format is automatically applied to new tables and partitions and our initial benchmarks show up to 50% size reduction in the data files on disk! A separate blog post covers in depth how CrateDB made it possible. This new format is automatically adopted by all tables and partitions created by clusters running CrateDB v5.10 where existing users can just sit back and relax as their partitions will rotate into the new format once old ones are deleted after their retention period runs out!
Prior releases digest
The current release v5.10 builds upon the foundations laid by prior releases. Here are some highlighted relevant features:
- Nested Join performance optimization (v5.6) - Optimizing nested Join performance in SQL is crucial for improving query execution speed, especially when dealing with large datasets. There are some overall key strategies doing this, e.g., proper indexing, minimizing the number of Joins, use of Hash Joins instead of Nested Loops. Most of these strategies are already implemented in CrateDB but in this release we added more optimizer rules pushing down filters and limit definition also below nested Left/Right Outer Join operation. Other rules will then pick them up and push such filters or limits to the most inner plan possible, improving the performance.
- Memory consumption reduction (v5.7) – Unwanted memory (HEAP) consumption when issuing the Analyze statement was eliminated by improving the implementation of the statistics collection (leading to up to 70% memory consumption reduction and ~12% performance increase).
- Two Join performance improvements (v5.8) – The first one relates to constant join conditions containing a constant value. The optimization here is about filtering out as many rows as possible directly at the collect operation resulting in fewer rows needed to be processed in the join operations. This minimizes the exponential row growth and excessive computation as most filters can utilize an Index to improve performance. The second optimization relates to the Lookup-Join implementation that was in release v5.7. This optimization will now also work with more complex queries e.g., when sub-queries or Inner-Equi-Joins are used.
- topk() and Lucene query optimization (v5.9) – By adding the topk() function CrateDB can quickly retrieve the top K elements based on frequency, relevance, or other ranking criteria. This optimization reduces computation by fetching only the top K results instead of sorting the entire dataset, aggregates values efficiently without scanning all rows, and is more memory efficient as it uses probabilistic algorithms to estimate top results quickly. The ApacheLucene is a full-text search engine library and even though Lucene does not natively support Is Distinct From, it allows CrateDB using Lucene indexes (e.g., PostgreSQL with
pg_trgm
) to optimize queries. It does that by filtering on nullable columns ensuringNulls
are handled efficiently and by avoiding full table scans when Nulls exist in joined tables. It also enables smarter query execution using the Lucene-backed indexes.
Conclusion
CrateDB v5.10 is a game-changer for businesses that demand high-speed, scalable analytics on massive datasets. With the introduction of Distributed Hash Joins, complex queries execute faster than ever, ensuring real-time insights even as data volumes grow. The new storage format drastically reduces disk usage—cutting storage requirements by up to 50%, leading to lower infrastructure costs without compromising performance. These release features together with the prior work done by CrateDB to add ease of use and to suppress bottlenecks makes CrateDB the ultimate unified data layer for analytics, search, and AI. Whether optimizing performance, reducing costs, or streamlining workflows, CrateDB v5.10 delivers the speed, efficiency, and scalability businesses need to stay ahead.