Partitioning strategy
CrateDB's table partitioning feature is a powerful tool for efficiently handling large datasets, especially time-series data. By using the PARTITIONED BY clause, data is split into smaller, more manageable partitions. This makes it easier to query and manage relevant data subsets, while also reducing storage costs by removing outdated or irrelevant data without affecting the rest of the dataset.
When new data is ingested into a partitioned table, CrateDB automatically creates new partitions when needed. Similarly, old data can be easily removed by dropping entire partitions without the costly process of index rebuilds. This makes it very easy to manage historical data and ensures that the database remains fast and efficient.
CrateDB's partitioning strategy allows for automated purging of old data, while retaining relevant information. This is useful for businesses with large datasets, as it optimizes storage and query efficiency.
CREATE TABLE t1 ( name STRING, month TIMESTAMP ) CLUSTERED INTO 3 SHARDS PARTITIONED BY (month); INSERT INTO t1 (name, month) VALUES ( ('foo', '2023-01-01’), ('bar', '2023-02-01') ); DELETE FROM t1 WHERE month = '2023-01-01';
Data Tiering
CrateDB is a distributed database that can combine different storage types - hot storage for frequently accessed data and cold storage for less accessed data. You can assign a storage type to each node in the cluster using the node.attr.storage attribute.
To configure data tiering, we first define that new partitions are to be placed on a hot node, using the ("routing.allocation.require.storage" = 'hot') setting.
CREATE TABLE t1 ( name STRING, month TIMESTAMP) PARTITIONED BY (month) WITH ("routing.allocation.require.storage" = 'hot');
To relocate a partition to cold storage, we alter it with ("routing.allocation.require.storage" = 'cold').
ALTER TABLE t1 PARTITION (month = '2023-02-01') SET ("routing.allocation.require.storage" = 'cold');
Cold storage is used for less frequently accessed data, which can be stored on more cost-effective disks. CrateDB’s cluster automatically initiates the relocation of the affected partitions to a node that fulfills the specified requirements.
Retention policy
CrateDB uses partitioned tables to manage data retention efficiently. Once the data has reached the end of its retention period and is no longer needed, the DELETE FROM statement is used to remove data from the table.
DELETE FROM t1 WHERE month = '2023-01-01';
You can create a table called retention_policies to store information about the schema name, table name, partition column, retention period, and strategy for data exceeding retention period.
CREATE TABLE IF NOT EXISTS retention_policies ( "table_schema" TEXT, "table_name" TEXT, "partition_column" TEXT NOT NULL, "retention_period" INTEGER NOT NULL, "strategy" TEXT NOT NULL, PRIMARY KEY ("table_schema", "table_name", "strategy")
In the example below, you insert a policy for the 't1' table in the default 'doc' schema, where data is partitioned by month, and data older than 3 months will be subject to the 'reallocate' strategy. This means that after 3 months, the data will be moved from hot to cold storage within the cluster.
INSERT INTO retention_policies (table_schema, table_name, partition_column, retention_period, strategy) VALUES ('doc', ‘t1', ‘month’, 3, 'reallocate');
Snapshots
Snapshots are a great way to archive old partitions. They capture the state of tables at the exact moment they are taken. You can use them to back up and restore individual partitions, managing historical data that doesn't need to be available in the hot or warm storage.
Snapshots are stored in repositories, which act as storage containers. The repository is set up with a specific storage backend, such as Amazon S3, Microsoft Azure Blob Storage, Google Cloud Storage, or a local file system.
Step 1: Create a repository for storing snapshots
Use the CREATE REPOSITORY statement in CrateDB. In the example below, we create a repository called 'export_cold' and use S3 as the storage backend. The WITH clause includes all the necessary configuration details, such as protocol, endpoint, access credentials, and the name of the bucket.
CREATE REPOSITORY export_cold TYPE s3 WITH ( protocol = 'https’, endpoint = 's3-store.example.org:443’, access_key = '’, secret_key = '’, bucket = 'cratedb-cold-storage' );
Step 2: Create your snapshots
To create a snapshot, use the CREATE SNAPSHOT statement and specify which tables to include. The 'wait_for_completion=true' option indicates that the statement will not return control to the user until the snapshot is completely created.
CREATE SNAPSHOT export_cold.snapshot1 TABLE t1 PARTITION (date='2024-01-01') WITH (wait_for_completion=true);
Step 3: Restoring snapshots
Snapshots can help recover lost or corrupted data by reverting to a previous state. Here's how to restore your data with CrateDB.
Option 1: Restore a single table partition
If you need to restore just a single partition of a table, you can use the RESTORE SNAPSHOT command with the PARTITION clause. Our example shows how to restore the partition for January 2023 from a snapshot named 'snapshot1' in the repository 'export_cold'.
RESTORE SNAPSHOT export_cold.snapshot1 TABLE t1 PARTITION (month='2023-01-01') WITH (wait_for_completion=true);
Option 2: Restore an entire table
To restore an entire table, omit the PARTITION clause.
RESTORE SNAPSHOT export_cold.snapshot1 TABLE t1 WITH (wait_for_completion=true);
Option 3: Restore the entire snapshot
If you want to restore the whole content of the snapshot, use the ALL keyword. This restores all the tables, partitions, and possibly other cluster metadata contained in the snapshot.
RESTORE SNAPSHOT export_cold.snapshot1 ALL WITH (wait_for_completion=true);