Skip to content
Data models > Time series

Time Series Data Lifecycle

Partitioning strategy

CrateDB's table partitioning feature is a powerful tool for efficiently handling large datasets, especially time-series data. By using the PARTITIONED BY clause, data is split into smaller, more manageable partitions. This makes it easier to query and manage relevant data subsets, while also reducing storage costs by removing outdated or irrelevant data without affecting the rest of the dataset.  

When new data is ingested into a partitioned table, CrateDB automatically creates new partitions when needed. Similarly, old data can be easily removed by dropping entire partitions without the costly process of index rebuilds. This makes it very easy to manage historical data and ensures that the database remains fast and efficient.  

CrateDB's partitioning strategy allows for automated purging of old data, while retaining relevant information. This is useful for businesses with large datasets, as it optimizes storage and query efficiency. 

CREATE TABLE t1 ( 
   name STRING, 
   month TIMESTAMP 
)  
CLUSTERED INTO 3 SHARDS 
PARTITIONED BY (month); 

INSERT INTO t1 (name, month) VALUES (  
  ('foo', '2023-01-01’), 
  ('bar', '2023-02-01') 
); 

DELETE FROM t1 WHERE month = '2023-01-01'; 

Data Tiering

CrateDB is a distributed database that can combine different storage types - hot storage for frequently accessed data and cold storage for less accessed data. You can assign a storage type to each node in the cluster using the node.attr.storage attribute. 

Data Tiering in CrateDB: Hot, Warm, and Cold dataTo configure data tiering, we first define that new partitions are to be placed on a hot node, using the ("routing.allocation.require.storage" = 'hot') setting.

CREATE TABLE t1 ( 
   name STRING, 
   month TIMESTAMP)  
PARTITIONED BY (month) 
WITH ("routing.allocation.require.storage" = 'hot'); 

To relocate a partition to cold storage, we alter it with ("routing.allocation.require.storage" = 'cold').

ALTER TABLE t1  
PARTITION (month = '2023-02-01')  
SET ("routing.allocation.require.storage" = 'cold');  

Cold storage is used for less frequently accessed data, which can be stored on more cost-effective disks. CrateDB’s cluster automatically initiates the relocation of the affected partitions to a node that fulfills the specified requirements.

Retention policy

CrateDB uses partitioned tables to manage data retention efficiently. Once the data has reached the end of its retention period and is no longer needed, the DELETE FROM statement is used to remove data from the table.

DELETE FROM t1 WHERE month = '2023-01-01';   

You can create a table called retention_policies to store information about the schema name, table name, partition column, retention period, and strategy for data exceeding retention period.

CREATE TABLE IF NOT EXISTS retention_policies (    
   "table_schema" TEXT, 
   "table_name" TEXT, 
   "partition_column" TEXT NOT NULL, 
   "retention_period" INTEGER NOT NULL, 
   "strategy" TEXT NOT NULL, 
PRIMARY KEY ("table_schema", "table_name", "strategy") 

In the example below, you insert a policy for the 't1' table in the default 'doc' schema, where data is partitioned by month, and data older than 3 months will be subject to the 'reallocate' strategy. This means that after 3 months, the data will be moved from hot to cold storage within the cluster.

INSERT INTO retention_policies  
(table_schema, table_name, partition_column, retention_period, strategy) 
VALUES ('doc', ‘t1', ‘month’, 3, 'reallocate'); 

The combination of partitioned tables, data tiering and retention strategies in CrateDB enables efficient data lifecycle management. Review and adjust your retention policies periodically to align with changing data access patterns and storage capacity.

Snapshots

Snapshots are a great way to archive old partitions. They capture the state of tables at the exact moment they are taken. You can use them to back up and restore individual partitions, managing historical data that doesn't need to be available in the hot or warm storage.

Snapshots are stored in repositories, which act as storage containers. The repository is set up with a specific storage backend, such as Amazon S3, Microsoft Azure Blob Storage, Google Cloud Storage, or a local file system.

Step 1: Create a repository for storing snapshots

Use the CREATE REPOSITORY statement in CrateDB. In the example below, we create a repository called 'export_cold' and use S3 as the storage backend. The WITH clause includes all the necessary configuration details, such as protocol, endpoint, access credentials, and the name of the bucket.

CREATE REPOSITORY export_cold 
TYPE s3 
WITH ( 
  protocol = 'https’, 
  endpoint = 's3-store.example.org:443’, 
  access_key = '’, 
  secret_key = '’, 
  bucket = 'cratedb-cold-storage' 
); 

Step 2: Create your snapshots

To create a snapshot, use the CREATE SNAPSHOT statement and specify which tables to include. The 'wait_for_completion=true' option indicates that the statement will not return control to the user until the snapshot is completely created.

CREATE SNAPSHOT export_cold.snapshot1 TABLE t1 
PARTITION (date='2024-01-01') 
WITH (wait_for_completion=true); 

Step 3: Restoring snapshots

Snapshots can help recover lost or corrupted data by reverting to a previous state. Here's how to restore your data with CrateDB.

Option 1: Restore a single table partition

If you need to restore just a single partition of a table, you can use the RESTORE SNAPSHOT command with the PARTITION clause. Our example shows how to restore the partition for January 2023 from a snapshot named 'snapshot1' in the repository 'export_cold'.

RESTORE SNAPSHOT export_cold.snapshot1 
TABLE t1 PARTITION (month='2023-01-01') 
WITH (wait_for_completion=true); 

Option 2: Restore an entire table

To restore an entire table, omit the PARTITION clause.

RESTORE SNAPSHOT export_cold.snapshot1 
TABLE t1  
WITH (wait_for_completion=true);   

Option 3: Restore the entire snapshot

If you want to restore the whole content of the snapshot, use the ALL keyword. This restores all the tables, partitions, and possibly other cluster metadata contained in the snapshot.

RESTORE SNAPSHOT export_cold.snapshot1 ALL  
WITH (wait_for_completion=true);    

Always maintain recent backups and test restore procedures regularly for disaster recovery planning.

 

Want to read more?

Whitepaper" Guide for Time Series Data Projects