Live Stream: Turbocharge your aggregations, search & AI models & get real-time insights

Register now
Skip to content
Resources > Academy > CrateDB Fundamentals

Architecture Overview

Login or sign up for free CrateDB Academy: CrateDB Fundamentals

It's time to take a deeper look at the architecture of CrateDB. In this video, you'll learn more about how CrateDB clusters work and how data is stored in a clustered environment. We'll also take a look at the different ways that CrateDB can be deployed.

CrateDB is an open source database that allows you to store any kind of structured, semi structured and unstructured data in a single data store. It offers dynamic schemas and provides access to data via standard SQL with support for the PostgreSQL Wire Protocol.

CrateDB offers the best of SQL, NoSQL and search engines, enabling full-text and vector similarity search to power traditional and AI enabled applications. A distributed shared-nothing architecture enables high availability, vertical and horizontal scaling, and high volume concurrent reads and writes with the distributed query engine. The combination of advanced indexing and columnar storage enables queries in single digit milliseconds across billions of rows of data.

CrateDB enables faster time to market with very low operational overhead. Flexible deployment options address a variety of use cases.

CrateDB is a distributed database that uses a shared-nothing architecture. Each node is autonomous. It operates independently without relying on shared resources. This approach enables high availability, vertical and horizontal scaling, and high volume concurrent reads and writes.

As your data set grows or your processing needs increase, you can simply add more nodes to the cluster. CrateDB will automatically balance the data across the nodes, optimizing query efficiency and simplifying maintenance.

Let's dig deeper into the architecture of a node in a CrateDB cluster. Unlike in a primary secondary architecture, all nodes in a CrateDB cluster are equal. They can perform any operation.

The four major components of each node are as follows:

  • The SQL handler is responsible for incoming client requests. The SQL handler pauses and analysis SQL statements, creating an execution plan.
  • The Job execution service manages the execution of plans or jobs. Jobs, which may contain multiple operations, are distributed by the transport node to other nodes involved in their execution.
  • The Cluster State service manages the cluster state. It is responsible for master node election and node discovery.
  • Finally, the Data Storage component. This component handles storage and retrieval of data from disk based on the execution plan.

As CrateDB stored data in sharded tables, it is divided across multiple nodes in a cluster. Each shard is a distinct Lucene index stored on the file system. Each node is accessible via 3 ports. SQL queries are accepted and results returned via the PostgreSQL Wire Protocol port or the HTTP port. The third port is used as a transport port for inter cluster communication. For a deeper dive into node operations, be sure to download a copy of our architecture guide.

As business needs grow, so do data volumes and hardware requirements. When more CPU, RAM or disk storage is needed, additional nodes can be seamlessly added to the CrateDB cluster. Without manual intervention, the cluster will automatically rebalance the data to accommodate the new nodes.

Here, we have a three node cluster that's utilising around 70% of its available storage space. Adding a fourth node results in an unbalanced distribution of data. CrateDB initiates an automatic redistribution. Data redistribution continues until an almost equal level of storage utilization is achieved across the four nodes now making up the enlarged cluster.

One of the key benefits of a distributed database is the ability to provide high availability, crucial for always on applications. CrateDB allows nodes in a cluster to be distributed across multiple availability zones or data centres. CrateDB clusters exhibit self healing characteristics. Nodes rejoining a cluster after a failure automatically synchronize with the latest data.

Let's walk through a typical failure and recovery process. Here, we have a cluster of three nodes. A node leaves the cluster, perhaps due to hardware failure, network partition or a rolling maintenance task. The data is automatically replicated in CrateDB. There is an automatic failover, ensuring no loss of data despite one node having left the cluster. When the node is back up, or a new node joins the cluster to replace it, data is automatically synchronised and rebalanced as necessary. Once the data synchronisation is complete, the node becomes fully operational. The cluster has recovered autonomously without any manual intervention required.

All tables in CrateDB are sharded and optionally partitioned. This means that tables are divided and distributed across the nodes forming a cluster. Each shard in CrateDB corresponds to a Lucene index made of segments stored on the file system. These files physically reside in one of the configured data directories on the nodes. Lucene only appends data to segment files. Data already written to disk is never mutated. This simplifies replication and recovery.

Synchronising a shard becomes a straightforward process of fetching data from a specific marker. CrateDB periodically merges segments as they grow over time. This is an automatic process that runs in the background, but users can also initiate it on demand using the OPTIMIZE TABLE command. See the CrateDB documentation for more details.

Tables in CrateDB can be divided by defining partition columns. This CREATE TABLE statement creates a table to store details of our 311 calls from a Chicago data set. The createddate column contains a timestamp representing the date and time that the call was logged. Here we're defining a generated column 'week' populated by truncating the created date to the timestamp for the start of the week that it falls in. Now we're telling CrateDB to partition this table by that generated field. This partitions the table by week, with CrateDB automatically generating partitions as needed on the fly as data is inserted into the table.

Partitioning tables offers several advantages, including flexible sharding, improved query performance, and the ability to backup and delete old data in an efficient way. For more information on partitioning, refer to the online documentation.

CrateDB offers a range of deployment options, ensuring that there's a solution that aligns seamlessly with your infrastructure and business goals. CrateDB excels in public cloud environments, including AWS, Azure, and Google Cloud. This option is ideal for those in pursuit of scalability and flexibility. Leveraging the public cloud model enables CrateDB to dynamically scale resources based on workload demands. This provides a cost effective solution without requiring upfront hardware investment.

For those seeking cloud-like flexibility while maintaining control of their own environment, CrateDB can be deployed in private clouds. This approach combines the advantages of cloud computing with the privacy and security controls of an on premises deployment. Organisations prioritising security and control can deploy CrateDB on premise. In this model, existing infrastructure can be utilised and comprehensive control over data and database management maintained to ensure adherence to compliance and security protocols.

Hybrid deployments offer a balanced solution, combining the control and security of an on premise system with the scalability of the cloud. CrateDB allows sensitive data to be stored on premise whilst also leveraging cloud resources for scalable compute and storage.

CrateDB is uniquely positioned for edge computing environments. Deploying CrateDB at the edge facilitates immediate data processing, reducing latency and bandwidth constraints associated with data transfer to centralized systems. This ensures faster insights and decision making at the point of data generation.

Finally, CrateDB offers support for containerization and orchestration tools such as Docker and Kubernetes, enhancing operational flexibility and ensuring adaptability to varying infrastructure requirements.

This video provided a high level overview of the architecture of CrateDB. To learn more, including how CrateDB handles replication, atomicity, and consistency, download a copy of the comprehensive Architecture guide from our website.

Take this course for free