Live Stream on Jan 23rd: Unlocking Real Time Insights in the Renewable Energy Sector with CrateDB

Register now
Skip to content
Blog

How CrateDB Compares to Rockset (and Elasticsearch/OpenSearch) for Streaming Ingest

CrateDB is an ideal replacement for Rockset due to the overlapping features for real-time analytics and search on streaming data – mainly achieved by indexing all attributes (aka converged index in Rockset). For more information, please look at the detailed feature comparison

After working in a couple of migrations, customers frequently ask us how the ingest performance compares between Rockset and CrateDB. As the customer-specific tests cannot be shared publicly, we decided to implement the Rockbench streaming ingest benchmark for CrateDB, which has originally been designed to compare Rockset with Elasticsearch.  

The benchmark evaluates ingestion performance on throughput and data latency. Throughput measures the rate at which data is processed, impacting the database's ability to efficiently support high-velocity data streams. Data latency, on the other hand, refers to the amount of time it takes to ingest and index the data and make it available for querying, affecting the ability of a database to provide up-to-date results. We examine latency at the 95th and 99th percentile, given that both databases are used for production applications and require predictable performance. 

The results are impressive:

  • CrateDB outperforms Rockset on the same hardware while saving between 20% and 60% on costs.
  • CrateDB achieves 6-9x lower latencies than Rockset for streaming ingest. There is another observation: when volumes increase, the latency increased linearly in Rockset, while remaining mostly flat in CrateDB.

Rockbench has been designed to compare Rockset with Elasticsearch, so the results outlined below can easily be taken as a comparison between Elasticsearch/OpenSearch and CrateDB as well. Please see the summary chapter.

While the whole benchmarking framework is well-explained on the Rockset blog, we will focus on comparing the results in our post.

System setup

We set up a similar CrateDB cluster to meet the overall hardware specification outlined by Rockset. As CrateDB is a distributed system and shows great linear performance when scaling horizontally, we decided to leverage these benefits and avoid overprovisioning of hardware that is not utilized in a proper way. We used a setup based on CR4 instances of CrateDB Cloud, which results in the configuration shown in the table. Please also note that the CrateDB instances are using only about 50% of the total RAM. 

Rockset CrateDB
2XLarge 
Allocated Compute: 64 vCPU 
Allocated RAM: 512 GB 
4 Nodes CR4 
Allocated Compute: 64 vCPU 
Allocated RAM: 220 GB 
4XLarge 
Allocated Compute: 128 vCPU 
Allocated RAM: 1,024 GB 
 8 Nodes CR4 
Allocated Compute: 128 vCPU 
Allocated RAM: 440 GB 


The database has been seeded with 1 billion documents. Each document is 1kB in size, which results in a total data set of 1 TB. No measurements were taken when seeding the database. This seeding is a must, as also outlined by Rockset, as many databases consume much less CPU in small data sets or run into swapping or long-running reindexing processes. Similar as in the Rockset setting, we started the benchmark after the seeding phase and started the measurement as soon as the data latency values stabilized. We also recorded the p50 and p95 latencies. 

Let us also compare this setting from an investment perspective. Rockset distinguishes between the compute and storage size, same for CrateDB. All prices are compared on AWS, us-east-1 (N. Virginia).

We compared costs for two scenarios each: As a lot of customers use Rockset not just as an analytical database, but build their whole business on top, we must ensure high availability. The typical Rockset setup requires to have at least two virtual instances and parallel ingest to achieve zero-downtime failover and maintenance processes. As CrateDB is already set-up as a multi-node cluster, we can increase the replication factor to one ensuring that a single node can break without any impact on the availability of the overall cluster – only the storage size needs to be increased to have sufficient disk space available. The availability can be further increased by increasing the replication factor. 

As we see in the different scenarios, CrateDB is about 25% more cost effective than Rockset in the Non-HA scenario, even about 60% more cost effective in the case of a true HA setup. This holds true for the 2XLarge and the 4XLarge virtual instance sizes. Even the HA scenario in CrateDB is more cost-effective than the single-node (non HA) scenario in Rockset. 

Comparison of the 2XLarge scenario between Rockset and CrateDB, Non-HA and HA

Comparison of the 2XLarge scenario between Rockset and CrateDB, Non-HA and HA

 

Comparison of the 4XLarge scenario between Rockset and CrateDB, Non-HA and HA

Comparison of the 4XLarge scenario between Rockset and CrateDB, Non-HA and HA

Results Comparison

The results are summarized in the tables and diagrams below for batch size of 50 and 500 for the 2XLarge and 4XLarge scenarios. Please note that the ingest performance of CrateDB is measured including the additional replica for high availability.

The key results are: 

  • CrateDB is on average 6-9x faster than Rockset for both batch sizes, while offering true high availability 
  • CrateDB can keep a stable latency for streaming ingests, even for heavy throughput above 12 MB/sec for 2XLarge and above 16MB/sec for 4XLarge, in the case of both batch sizes.
  • Given the same amount of CPUs and only ~50% of RAM as Rockset, CrateDB can handle higher throughput than Rockset at a lower price tag.

Batch Size 50

Latency results for a batch size of 50

Latency results for a batch size of 50

Query latencies for Rockset 2XLarge vs. CrateDB 4 Nodes CR4 (p50/p95), batch size 50 (lower is better)

Query latencies for Rockset 2XLarge vs. CrateDB 4 Nodes CR4 (p50/p95), batch size 50
(lower is better)

Query latencies for Rockset 4XLarge vs. CrateDB 8 Nodes CR4 (p50/p95), batch size 50
(lower is better)

Batch Size 500 

Latency results for a batch size of 500

Latency results for a batch size of 500

Query latencies for Rockset 2XLarge vs. CrateDB 4 Nodes CR4 (p50/p95), batch size 500 (lower is better)

Query latencies for Rockset 2XLarge vs. CrateDB 4 Nodes CR4 (p50/p95), batch size 500
(lower is better)

 

Query latencies for Rockset 4XLarge vs. CrateDB 8 Nodes CR4 (p50/p95), batch size 500
(lower is better)

Comparing Results between Rockset, Elasticsearch, and CrateDB 

Streaming data is an absolute must in today’s data architectures. Many industries rely on event streaming platforms and require efficient data stores to further analyze the influx of data in real-time. While Rockset and Elastic are eventually consistent (as CrateDB is as well based on the refresh of the IndexReaders), CrateDB ensures the latest and greatest data for queries by primary key as these data will always be queried via the translog. The prominent update and fetch use case is therefore possible. 

In order to show the efficiency of Rockset for this use case, Rockbench bench has been initially designed to compare Rockset with Elasticsearch on throughput and latency ofn streaming data. It has shown that Rockset is up to 4x higher throughput and 2.5x lower latency than Elasticsearch.  

As many Rockset customers currently revert back to Elasticsearch or OpenSearch (or consider them as an alternative), we see these results as a good proof of CrateDB's unique distributed SQL layer – and we can confidently state that CrateDB is significantly faster than Elasticsearch for streaming ingests – at least up to 10x.

Conclusion

We can state that CrateDB delivers not just lower latencies for high throughput rates, but also delivers more constant query latencies, which already contain replication to provide high availability (which would require to have at least two virtual instances of Rockset to achieve zero-downtime operations and therefore double the costs). 

Given a lock-free ingest and query mechanism, CrateDB can easily combine analytical queries with high ingestion rates, although this has not been demonstrated in this benchmark. This eliminates the need for multiple virtual instances to separate ingest from query workloads. 

We highly recommend working together with CrateDB to identify the right sizing of a cluster for your particular workload. Please note, that we offer free consulting services for a limited amount of time to migrate your Rockset environment to CrateDB. 

Talk to a solution engineer or Get started for free in CrateDB Cloud