System information¶

CrateDB provides the sys schema which contains virtual tables. These tables are read-only and can be queried to get statistical real-time information about the cluster, its nodes and their shards.

Note

During a rolling upgrade of the cluster to a newer version, while the cluster is in a mixed state with nodes on the older and on the new version, avoid querying the sys tables using SELECT *, as new columns could have been added, removed or modified between versions. Instead, use a defined list of the columns that you need to return from the query.

Cluster¶

Basic information about the CrateDB cluster can be retrieved from the sys.cluster table:

Name	Description	Return Type
`id`	A unique ID generated by the system.	`TEXT`
`license`	The current CrateDB license information. Always NULL. This exists for backward compatibility.	`OBJECT`
`name`	The cluster name.	`TEXT`
`master_node`	Node ID of the node which currently operates as master.	`TEXT`
`settings`	The cluster settings.	`OBJECT`
`state`	Base64 encoded cluster state. Only the `crate` superuser and users with `AL` permission can view the state. This is mainly for CrateDB engineers to debug issues.	`TEXT`

The result has at most 1 row:

cr> select name from sys.cluster;
+-----------------+
| name            |
+-----------------+
| Testing-CrateDB |
+-----------------+
SELECT 1 row in set (... sec)

Cluster settings¶

The sys.cluster.settings expression returns information about the currently applied cluster settings.

cr> select settings from sys.cluster;
+-----------------------------------------------------...-+
| settings                                                |
+-----------------------------------------------------...-+
| {"bulk": {...}, "cluster": {...}, "gateway": {...}, ... |
+-----------------------------------------------------...-+
SELECT 1 row in set (... sec)

cr> select column_name, data_type from information_schema.columns
... where column_name like 'settings%'
... and table_name = 'cluster';
+-----------------------------------------------------------------------------------+--------------+
| column_name                                                                       | data_type    |
+-----------------------------------------------------------------------------------+--------------+
| settings                                                                          | object       |
| settings['bulk']                                                                  | object       |
| settings['bulk']['request_timeout']                                               | text         |
| settings['cluster']                                                               | object       |
| settings['cluster']['graceful_stop']                                              | object       |
| settings['cluster']['graceful_stop']['force']                                     | boolean      |
| settings['cluster']['graceful_stop']['min_availability']                          | text         |
| settings['cluster']['graceful_stop']['timeout']                                   | text         |
| settings['cluster']['info']                                                       | object       |
| settings['cluster']['info']['update']                                             | object       |
| settings['cluster']['info']['update']['interval']                                 | text         |
| settings['cluster']['max_shards_per_node']                                        | integer      |
| settings['cluster']['routing']                                                    | object       |
| settings['cluster']['routing']['allocation']                                      | object       |
| settings['cluster']['routing']['allocation']['allow_rebalance']                   | text         |
| settings['cluster']['routing']['allocation']['awareness']                         | object       |
| settings['cluster']['routing']['allocation']['awareness']['attributes']           | text_array   |
| settings['cluster']['routing']['allocation']['awareness']['force']                | object       |
| settings['cluster']['routing']['allocation']['awareness']['force']['']            | object       |
| settings['cluster']['routing']['allocation']['balance']                           | object       |
| settings['cluster']['routing']['allocation']['balance']['index']                  | real         |
| settings['cluster']['routing']['allocation']['balance']['shard']                  | real         |
| settings['cluster']['routing']['allocation']['balance']['threshold']              | real         |
| settings['cluster']['routing']['allocation']['cluster_concurrent_rebalance']      | integer      |
| settings['cluster']['routing']['allocation']['disk']                              | object       |
| settings['cluster']['routing']['allocation']['disk']['threshold_enabled']         | boolean      |
| settings['cluster']['routing']['allocation']['disk']['watermark']                 | object       |
| settings['cluster']['routing']['allocation']['disk']['watermark']['flood_stage']  | text         |
| settings['cluster']['routing']['allocation']['disk']['watermark']['high']         | text         |
| settings['cluster']['routing']['allocation']['disk']['watermark']['low']          | text         |
| settings['cluster']['routing']['allocation']['enable']                            | text         |
| settings['cluster']['routing']['allocation']['exclude']                           | object       |
| settings['cluster']['routing']['allocation']['exclude']['_host']                  | text         |
| settings['cluster']['routing']['allocation']['exclude']['_id']                    | text         |
| settings['cluster']['routing']['allocation']['exclude']['_ip']                    | text         |
| settings['cluster']['routing']['allocation']['exclude']['_name']                  | text         |
| settings['cluster']['routing']['allocation']['include']                           | object       |
| settings['cluster']['routing']['allocation']['include']['_host']                  | text         |
| settings['cluster']['routing']['allocation']['include']['_id']                    | text         |
| settings['cluster']['routing']['allocation']['include']['_ip']                    | text         |
| settings['cluster']['routing']['allocation']['include']['_name']                  | text         |
| settings['cluster']['routing']['allocation']['node_concurrent_recoveries']        | integer      |
| settings['cluster']['routing']['allocation']['node_initial_primaries_recoveries'] | integer      |
| settings['cluster']['routing']['allocation']['require']                           | object       |
| settings['cluster']['routing']['allocation']['require']['_host']                  | text         |
| settings['cluster']['routing']['allocation']['require']['_id']                    | text         |
| settings['cluster']['routing']['allocation']['require']['_ip']                    | text         |
| settings['cluster']['routing']['allocation']['require']['_name']                  | text         |
| settings['cluster']['routing']['allocation']['total_shards_per_node']             | integer      |
| settings['cluster']['routing']['rebalance']                                       | object       |
| settings['cluster']['routing']['rebalance']['enable']                             | text         |
| settings['fdw']                                                                   | object       |
| settings['fdw']['allow_local']                                                    | boolean      |
| settings['gateway']                                                               | object       |
| settings['gateway']['expected_data_nodes']                                        | integer      |
| settings['gateway']['expected_nodes']                                             | integer      |
| settings['gateway']['recover_after_data_nodes']                                   | integer      |
| settings['gateway']['recover_after_nodes']                                        | integer      |
| settings['gateway']['recover_after_time']                                         | text         |
| settings['indices']                                                               | object       |
| settings['indices']['breaker']                                                    | object       |
| settings['indices']['breaker']['policy']                                          | text         |
| settings['indices']['breaker']['query']                                           | object       |
| settings['indices']['breaker']['query']['limit']                                  | text         |
| settings['indices']['breaker']['request']                                         | object       |
| settings['indices']['breaker']['request']['limit']                                | text         |
| settings['indices']['breaker']['total']                                           | object       |
| settings['indices']['breaker']['total']['limit']                                  | text         |
| settings['indices']['recovery']                                                   | object       |
| settings['indices']['recovery']['internal_action_long_timeout']                   | text         |
| settings['indices']['recovery']['internal_action_timeout']                        | text         |
| settings['indices']['recovery']['max_bytes_per_sec']                              | text         |
| settings['indices']['recovery']['max_concurrent_file_chunks']                     | integer      |
| settings['indices']['recovery']['recovery_activity_timeout']                      | text         |
| settings['indices']['recovery']['retry_delay_network']                            | text         |
| settings['indices']['recovery']['retry_delay_state_sync']                         | text         |
| settings['indices']['replication']                                                | object       |
| settings['indices']['replication']['retry_timeout']                               | text         |
| settings['logger']                                                                | object_array |
| settings['logger']['level']                                                       | text_array   |
| settings['logger']['name']                                                        | text_array   |
| settings['memory']                                                                | object       |
| settings['memory']['allocation']                                                  | object       |
| settings['memory']['allocation']['type']                                          | text         |
| settings['memory']['operation_limit']                                             | integer      |
| settings['overload_protection']                                                   | object       |
| settings['overload_protection']['dml']                                            | object       |
| settings['overload_protection']['dml']['initial_concurrency']                     | integer      |
| settings['overload_protection']['dml']['max_concurrency']                         | integer      |
| settings['overload_protection']['dml']['min_concurrency']                         | integer      |
| settings['overload_protection']['dml']['queue_size']                              | integer      |
| settings['replication']                                                           | object       |
| settings['replication']['logical']                                                | object       |
| settings['replication']['logical']['ops_batch_size']                              | integer      |
| settings['replication']['logical']['reads_poll_duration']                         | text         |
| settings['replication']['logical']['recovery']                                    | object       |
| settings['replication']['logical']['recovery']['chunk_size']                      | text         |
| settings['replication']['logical']['recovery']['max_concurrent_file_chunks']      | integer      |
| settings['statement_max_length']                                                  | integer      |
| settings['statement_timeout']                                                     | text         |
| settings['stats']                                                                 | object       |
| settings['stats']['breaker']                                                      | object       |
| settings['stats']['breaker']['log']                                               | object       |
| settings['stats']['breaker']['log']['jobs']                                       | object       |
| settings['stats']['breaker']['log']['jobs']['limit']                              | text         |
| settings['stats']['breaker']['log']['operations']                                 | object       |
| settings['stats']['breaker']['log']['operations']['limit']                        | text         |
| settings['stats']['enabled']                                                      | boolean      |
| settings['stats']['jobs_log_expiration']                                          | text         |
| settings['stats']['jobs_log_filter']                                              | text         |
| settings['stats']['jobs_log_persistent_filter']                                   | text         |
| settings['stats']['jobs_log_size']                                                | integer      |
| settings['stats']['operations_log_expiration']                                    | text         |
| settings['stats']['operations_log_size']                                          | integer      |
| settings['stats']['service']                                                      | object       |
| settings['stats']['service']['interval']                                          | text         |
| settings['stats']['service']['max_bytes_per_sec']                                 | text         |
| settings['udc']                                                                   | object       |
| settings['udc']['enabled']                                                        | boolean      |
| settings['udc']['initial_delay']                                                  | text         |
| settings['udc']['interval']                                                       | text         |
| settings['udc']['url']                                                            | text         |
+-----------------------------------------------------------------------------------+--------------+
SELECT ... rows in set (... sec)

For further details, see the Cluster Settings configuration section.

Nodes¶

To get information about the nodes query for sys.nodes.

This table can be queried for one, multiple or all nodes within a cluster.

The table schema is as follows:

`id`¶

Column Name	Description	Return Type
`id`	A unique ID within the cluster generated by the system.	`TEXT`

`name`¶

Column Name	Description	Return Type
`name`	The node name within a cluster. The system will choose a random name. You can also customize the node name, see Node-specific settings.	`TEXT`

`hostname`¶

Column Name	Description	Return Type
`hostname`	The specified host name of the machine the node is running on.	`TEXT`

`rest_url`¶

Column Name	Description	Return Type
`rest_url`	Full HTTP(s) address where the REST API of the node is exposed, including schema, hostname (or IP) and port.	`TEXT`

`is_master`¶

Column Name	Description	Return Type
`is_master`	Boolean flag which is `true` for the currently elected master node of the cluster, and `false` for all other nodes.	`BOOLEAN`

`roles`¶

Column Name	Description	Return Type
`roles`	List the roles of the node in the cluster: `master_eligible`: the node can be elected as master `data`: the node can store data	`TEXT_ARRAY`

`attributes`¶

Column Name	Description	Return Type
`attributes`	The custom attributes set for the node, e.g. if `node.attr.color` is `blue`, and `node.attr.location` is east`, the value of this column would be: ``{color=blue, location=east}	`OBJECT`

`port`¶

Column Name	Description	Return Type
`port`	The specified ports for both HTTP and binary transport interfaces. You can also customize the ports setting, see Ports.	`OBJECT`
`port['http']`	CrateDB’s HTTP port.	`INTEGER`
`port['transport']`	CrateDB’s binary transport port.	`INTEGER`
`port['psql']`	The PostgreSQL wire protocol port.	`INTEGER`

`load`¶

Column Name	Description	Return Type
`load`	System load statistics	`OBJECT`
`load['1']`	Average load over the last 1 minute.	`DOUBLE PRECISION`
`load['5']`	Average load over the last 5 minutes.	`DOUBLE PRECISION`
`load['15']`	Average load over the last 15 minutes.	`DOUBLE PRECISION`
`load['probe_timestamp']`	Unix timestamp at the time of collection of the load probe.	`BIGINT`

`mem`¶

Column Name	Description	Return Type
`mem`	Memory utilization statistics of the host.	`OBJECT`
`mem['used']`	Currently used memory in bytes.	`BIGINT`
`mem['used_percent']`	Currently used memory in percent of total.	`SMALLINT`
`mem['free']`	Currently available memory in bytes.	`BIGINT`
`mem['free_percent']`	Currently available memory in percent of total.	`SMALLINT`
`mem['probe_timestamp']`	Unix timestamp at the time of collection of the memory probe.	`BIGINT`

`heap`¶

Column Name	Description	Return Type
`heap`	Heap memory utilization statistics.	`OBJECT`
`heap['used']`	Currently used heap memory in bytes.	`BIGINT`
`heap['max']`	Maximum available heap memory. You can specify the max heap memory CrateDB should use in the Configuration.	`BIGINT`
`heap['free']`	Currently available heap memory in bytes.	`BIGINT`
`heap['probe_timestamp']`	Unix timestamp at the time of collection of the heap probe.	`BIGINT`

`version`¶

Column Name	Description	Return Type
`version`	CrateDB version information.	`OBJECT`
`version['number']`	Version string in format `"major.minor.hotfix"`	`TEXT`
`version['build_hash']`	SHA hash of the GitHub commit which this build was built from.	`TEXT`
`version['build_snapshot']`	Indicates whether this build is a snapshot build.	`BOOLEAN`
`version['minimum_index_compatibility_version']`	Indicates the minimum compatible index version which is supported.	`TEXT`
`version['minimum_wire_compatibility_version']`	Indicates the minimum compatible wire protocol version which is supported.	`TEXT`

`cluster_state_version`¶

Column Name	Description	Return Type
`cluster_state_version`	The current version of the cluster state. The cluster state is an immutable structure and that is recreated when a change is published.	`BIGINT`

`fs`¶

Column Name	Description	Return Type
`fs`	Utilization statistics about the file system.	`OBJECT`
`fs['total']`	Aggregated usage statistic of all disks on the host.	`OBJECT`
`fs['total']['size']`	Total size of all disks in bytes.	`BIGINT`
`fs['total']['used']`	Total used space of all disks in bytes.	`BIGINT`
`fs['total']['available']`	Total available space of all disks in bytes.	`BIGINT`
`fs['total']['reads']`	Total number of reads on all disks.	`BIGINT`
`fs['total']['bytes_read']`	Total size of reads on all disks in bytes.	`BIGINT`
`fs['total']['writes']`	Total number of writes on all disks.	`BIGINT`
`fs['total']['bytes_written']`	Total size of writes on all disks in bytes.	`BIGINT`
`fs['disks']`	Usage statistics of individual disks on the host.	`ARRAY`
`fs['disks']['dev']`	Device name	`TEXT`
`fs['disks']['size']`	Total size of the disk in bytes.	`BIGINT`
`fs['disks']['used']`	Used space of the disk in bytes.	`BIGINT`
`fs['disks']['available']`	Available space of the disk in bytes.	`BIGINT`
`fs['data']`	Information about data paths used by the node.	`ARRAY`
`fs['data']['dev']`	Device name	`TEXT`
`fs['data']['path']`	File path where the data of the node resides.	`TEXT`

`thread_pools`¶

Column Name	Description	Return Type
`thread_pools`	Usage statistics of Java thread pools.	`ARRAY`
`thread_pools['name']`	Name of the pool.	`TEXT`
`thread_pools['active']`	Number of currently running thread in the thread pool.	`INTEGER`
`thread_pools['rejected']`	Total number of rejected threads in the thread pool.	`BIGINT`
`thread_pools['largest']`	Largest number of threads that have ever simultaneously been in the pool.	`INTEGER`
`thread_pools['completed']`	Total number of completed thread in the thread pool.	`BIGINT`
`thread_pools['threads']`	Size of the thread pool.	`INTEGER`
`thread_pools['queue']`	Number of thread currently in the queue.	`INTEGER`

`os`¶

Column Name	Description	Return Type
`os`	Operating system stats	`OBJECT`
`os['uptime']`	System uptime in milliseconds Requires allowing system calls on Windows and macOS. See notes in Uptime limitations.	`BIGINT`
`os['timestamp']`	UNIX timestamp in millisecond resolution	`BIGINT`
`os['cpu']`	Information about CPU utilization	`OBJECT`
`os['cpu']['used']`	System CPU usage as percentage	`SMALLINT`
`os['probe_timestamp']`	Unix timestamp at the time of collection of the OS probe.	`BIGINT`
`os['cgroup']`	Information about cgroups (Linux only)	`OBJECT`
`os['cgroup']['cpuacct']`	Information about CPU accounting	`OBJECT`
`os['cgroup']['cpuacct']['control_group']`	The path to the CPU accounting cgroup	`TEXT`
`os['cgroup']['cpuacct']['usage_nanos']`	The total CPU time (in nanoseconds) consumed by all tasks in this cgroup.	`BIGINT`
`os['cgroup']['cpu']`	Information about the CPU subsystem	`OBJECT`
`os['cgroup']['cpu']['control_group']`	The path to the CPU cgroup	`TEXT`
`os['cgroup']['cpu']['cfs_period_micros']`	The period of time (in microseconds) the cgroup access to the CPU gets reallocated.	`BIGINT`
`os['cgroup']['cpu']['cfs_quota_micros']`	The total amount of time (in microseconds) for which all tasks in the cgroup can run during one period (cfs_period_micros).	`BIGINT`
`os['cgroup']['cpu']['num_elapsed_periods']`	The nr. of period intervals (cfs_period_micros) that have elapsed.	`BIGINT`
`os['cgroup']['cpu']['num_times_throttled']`	The nr. of times tasks in the cgroup have been throttled.	`BIGINT`
`os['cgroup']['cpu']['time_throttled_nanos']`	The total time (in nanoseconds) for which tasks in the cgroup have been throttled.	`BIGINT`
`os['cgroup']['mem']`	Information about memory resources used by tasks in a cgroup.	`OBJECT`
`os['cgroup']['mem']['control_group']`	The path to the memory cgroup	`TEXT`
`os['cgroup']['mem']['usage_bytes']`	The total current memory usage by processes in the cgroup.	`TEXT`
`os['cgroup']['mem']['limit_bytes']`	The max. amount of user memory in the cgroup.	`TEXT`

The CPU information values are cached for 1s. They might differ from the actual values at query time. Use the probe timestamp to get the time of collection. When analyzing the CPU usage over time, always use os['probe_timestamp'] to calculate the time difference between 2 probes.

cgroup limitations¶

Note

cgroup metrics only work if the stats are available from /sys/fs/cgroup/cpu and /sys/fs/cgroup/cpuacct.

Uptime limitations¶

Note

os[‘uptime’] required a system call when running CrateDB on Windows or macOS, however, system calls are not permitted by default. If you require this metric you need to allow system calls by setting bootstrap.seccomp to false. This setting must be set in the crate.yml or via command line argument and cannot be changed at runtime.

`os_info`¶

Column Name	Description	Return Type
`os_info`	Operating system information	`OBJECT`
`os_info['available_processors']`	Number of processors that are available in the JVM. This is usually equal to the number of cores of the CPU.	`INTEGER`
`os_info['name']`	Name of the operating system (ex: Linux, Windows, macOS)	`TEXT`
`os_info['arch']`	Name of the JVM architecture (ex: amd64, x86)	`TEXT`
`os_info['version']`	Version of the operating system	`TEXT`
`os_info['jvm']`	Information about the JVM (Java Virtual Machine)	`OBJECT`
`os_info['jvm']['version']`	The JVM version	`TEXT`
`os_info['jvm']['vm_name']`	The name of the JVM (e.g. OpenJDK, Java HotSpot(TM) )	`TEXT`
`os_info['jvm']['vm_vendor']`	The vendor name of the JVM	`TEXT`
`os_info['jvm']['vm_version']`	The version of the JVM	`TEXT`

`connections`¶

Column Name	Description	Return Type
`http`	Number of connections established via HTTP	`OBJECT`
`http['open']`	The currently open connections established via HTTP	`BIGINT`
`http['total']`	The total number of connections that have been established via HTTP over the life time of a CrateDB node	`BIGINT`
`psql`	Number of connections established via PostgreSQL protocol	`OBJECT`
`psql['open']`	The currently open connections established via PostgreSQL protocol	`BIGINT`
`psql['total']`	The total number of connections that have been established via PostgreSQL protocol over the life time of a CrateDB node	`BIGINT`
`transport`	Number of connections established via Transport protocol	`OBJECT`
`transport['open']`	The currently open connections established via Transport protocol	`BIGINT`
`transport['total']`	The total number of connections that have been established via Transport protocol over the life time of a CrateDB node	`BIGINT`

`process`¶

Column Name	Description	Return Type
`process`	Statistics about the CrateDB process.	`OBJECT`
`process['open_file_descriptors']`	Number of currently open file descriptors used by the CrateDB process.	`BIGINT`
`process['max_open_file_descriptors']`	The maximum number of open file descriptors CrateDB can use.	`BIGINT`
`process['probe_timestamp']`	The system UNIX timestamp at the moment of the probe collection.	`BIGINT`
`process['cpu']`	Information about the CPU usage of the CrateDB process.	`OBJECT`
`process['cpu']['percent']`	The CPU usage of the CrateDB JVM process given in percent.	`SMALLINT`

The CPU information values are cached for 1s. They might differ from the actual values at query time. Use the probe timestamp to get the time of the collect. When analyzing the CPU usage over time, always use process['probe_timestamp'] to calculate the time difference between 2 probes.

Note

If one of the queried nodes is not responding within three seconds it returns null every column except id and name. This behaviour could be used to detect hanging nodes.

Node checks¶

The table sys.node_checks exposes a list of internal node checks and results of their validation.

The table schema is the following:

Column Name	Description	Return Type
`id`	The unique check ID.	`INTEGER`
`node_id`	The unique node ID.	`TEXT`
`severity`	The level of severity. The higher the value of the field the higher severity.	`INTEGER`
`description`	The description message for the setting check.	`TEXT`
`passed`	The flag determines whether the check for the setting has passed.	`BOOLEAN`
`acknowledged`	The flag determines whether the check for this setting has been acknowledged by the user in order to ignored the value of `passed` column. This column can be updated.	`BOOLEAN`

Example query:

cr> select id, node_id, description from sys.node_checks order by id, node_id;
+----+---------...-+--------------------------------------------------------------...-+
| id | node_id     | description                                                      |
+----+---------...-+--------------------------------------------------------------...-+
|  1 | ...         | It has been detected that the 'gateway.expected_data_nodes' s... |
|  2 | ...         | The cluster setting 'gateway.recover_after_data_nodes' (or th... |
|  3 | ...         | If any of the "expected data nodes" recovery settings are set... |
|  5 | ...         | The high disk watermark is exceeded on the node. The cluster ... |
|  6 | ...         | The low disk watermark is exceeded on the node. The cluster w... |
|  7 | ...         | The flood stage disk watermark is exceeded on the node. Table... |
|  8 | ...         | The amount of shards on the node reached 90 % of the limit of... |
+----+---------...-+--------------------------------------------------------------...-+
SELECT 7 rows in set (... sec)

Acknowledge failed checks¶

It is possible to acknowledge every check by updating the acknowledged column. By doing this, specially CrateDB’s built-in Admin UI won’t complain anymore about failing checks.

Imagine we’ve added a new node to our cluster, but as the gateway.expected_data_nodes column can only be set via config-file or command-line argument, the check for this setting will not pass on the already running nodes until the config-file or command-line argument on these nodes is updated and the nodes are restarted (which is not what we want on a healthy well running cluster).

In order to make the Admin UI accept a failing check (so the checks label goes green again), we must acknowledge this check by updating it’s acknowledged flag:

cr> update sys.node_checks set acknowledged = true where id = 1;
UPDATE OK, 1 row affected (... sec)

Caution

Updates on this column are transient, so changed values are lost after the affected node is restarted.

Description of checked node settings¶

Recovery expected data nodes¶

This check looks at the gateway.expected_data_nodes setting and checks if its value matches the actual number of data nodes present in the cluster. If the actual number of nodes is below the expected number, the warning is raised to indicate some nodes are down. If the actual number is greater, this is flagged to indicate the setting should be updated.

Note

For backward compatibility, setting the deprecated gateway.expected_nodes instead is still supported. It counts all nodes, not only data-carrying nodes.

Recovery after data nodes¶

This check looks at the gateway.recover_after_data_nodes setting and checks if its value is greater than half the configured expected number, but not greater than the configured expected number.

Note

For backward compatibility, setting the deprecated gateway.recover_after_nodes instead is still supported.

(E / 2) < R <= E

Here, R is the number of recovery nodes and E is the number of expected (data) nodes.

If recovery is started when some nodes are down, CrateDB proceeds on the basis the nodes that are down may not be coming back, and it will create new replicas and rebalance shards as necessary. This is throttled, and it can be controlled with routing allocation settings, but depending on the context, you may prefer to delay recovery if the nodes are only down for a short period of time, so it is advisable to review the documentation around the settings involved and configure them carefully.

Recovery after time¶

If gateway.recover_after_data_nodes is set, then gateway.recover_after_time must not be set to 0s, otherwise the gateway.recover_after_data_nodes setting wouldn’t have any effect.

Note

For backward compatibility, setting the deprecated gateway.recover_after_nodes instead is still supported.

Routing allocation disk watermark high¶

The check for the cluster.routing.allocation.disk.watermark.high setting verifies that the high watermark is not exceeded on the current node. The usage of each disk for configured CrateDB data paths is verified against the threshold setting. If one or more verification fails the check is marked as not passed.

Routing allocation disk watermark low¶

The check for the cluster.routing.allocation.disk.watermark.low which controls the low watermark for the node disk usage. The check verifies that the low watermark is not exceeded on the current node. The verification is done against each disk for configured CrateDB data paths. The check is not passed if the verification for one or more disk fails.

Maximum shards per node¶

The check verifies that the amount of shards on the current node is less than 90 percent of cluster.max_shards_per_node. Creating new tables or partitions which would push the number of shards beyond 100 % of the limit will be rejected.

Shards¶

The table sys.shards contains real-time statistics for all shards of all (non-system) tables.

Table schema¶

Column Name	Description	Return Type
`node`	Information about the node the shard is located at.	`OBJECT`
`node['name']`	The name of the node the shard is located at.	`TEXT`
`node['id']`	The id of the node the shard is located at.	`TEXT`
`blob_path`	Path to the directory which contains the blob files of the shard, or null if the shard is not a blob shard.	`TEXT`
`id`	The shard id. This shard id is managed by the system, ranging from 0 up to the number of configured shards of the table.	`INTEGER`
`min_lucene_version`	Shows the oldest Lucene segment version used in this shard.	`TEXT`
`num_docs`	The total number of documents within a shard. Documents inserted after the shard becomes `idle` are not included. A `refresh` is required to ensure they are counted. See refresh interval for more details about idle tables.	`BIGINT`
`oprhan_partition`	True if this shard belongs to an orphaned partition which doesn’t belong to any table anymore.	`BOOLEAN`
`partition_ident`	The partition ident of a partitioned table. Empty for non-partitioned tables.	`TEXT`
`partition_uuid`	Internal id for the storage partition the shard is part of. Even tables without `PARTITIONED BY` clause have an internal storage partition.	`TEXT`
`path`	Path to the shard directory on the filesystem. This directory contains state and index files.	`TEXT`
`primary`	Indicates if this shard is the primary shard.	`BOOLEAN`
`recovery`	Recovery statistics for a shard.	`OBJECT`
`recovery['files']`	File recovery statistics	`OBJECT`
`recovery['files']['percent']`	Percentage of files already recovered.	`REAL`
`recovery['files']['recovered']`	Number of files recovered in the shard. Includes both existing and reused files.	`INTEGER`
`recovery['files']['reused']`	Total number of files reused from a local copy while recovering the shard.	`INTEGER`
`recovery['files']['used']`	Total number of files in the shard.	`INTEGER`
`recovery['size']`	Recovery statistics for the shard in bytes	`OBJECT`
`recovery['size']['percent']`	Percentage of bytes already recovered	`REAL`
`recovery['size']['recovered']`	Number of bytes recovered. Includes both existing and re-used bytes.	`BIGINT`
`recovery['size']['reused']`	Number of bytes re-used from a local copy while recovering the shard.	`BIGINT`
`recovery['size']['used']`	Total number of bytes in the shard.	`BIGINT`
`recovery['stage']`	Recovery stage: init: Recovery has not started index: Reading the Lucene index meta-data and copying bytes from source to destination start: Starting the engine, opening the index for use translog: Replaying transaction log finalize: Cleanup done: Complete	`TEXT`
`recovery['total_time']`	Returns elapsed time from the start of the shard recovery.	`BIGINT`
`recovery['type']`	Recovery type: gateway snapshot replica relocating	`TEXT`
`relocating_node`	The id of the node to which the shard is getting relocated to.	`TEXT`
`routing_state`	The current state of the shard in the routing table. Possible states are: UNASSIGNED INITIALIZING STARTED RELOCATING	`TEXT`
`schema_name`	The schema name of the table the shard belongs to	`TEXT`
`size`	The current size in bytes. This value is cached for a short period and may return slightly outdated values.	`BIGINT`
`state`	The current state of the shard. Possible states are: CREATED RECOVERING POST_RECOVERY STARTED RELOCATED CLOSED INITIALIZING UNASSIGNED	`TEXT`
`closed`	The state of the table associated with the shard.	`BOOLEAN`
`table_name`	The name of the table this shard belongs to	`TEXT`
`seq_no_stats`	Contains information about internal sequence numbering and checkpoints for these sequence numbers.	`OBJECT`
`seq_no_stats['max_seq_no']`	The highest sequence number that has been issued so far on the shard.	`BIGINT`
`seq_no_stats['local_checkpoint']`	The highest sequence number for which all lower sequence number of been processed on this shard. Due to concurrent indexing this can be lower than max_seq_no.	`BIGINT`
`seq_no_stats['global_checkpoint']`	The highest sequence number for which the local shard can guarantee that all lower sequence numbers have been processed on all active shard copies.	`BIGINT`
`translog_stats`	Contains information for the translog of the shard.	`OBJECT`
`translog_stats['size']`	The current size of the translog file in bytes.	`BIGINT`
`translog_stats['uncommitted_size']`	The size in bytes of the translog that has not been committed to Lucene yet.	`BIGINT`
`translog_stats['number_of_operations']`	The number of operations recorded in the translog.	`INTEGER`
`translog_stats['uncommitted_operations']`	The number of operations in the translog which have not been committed to Lucene yet.	`INTEGER`
`retention_leases`	Versioned collection of retention leases.	`OBJECT`
`flush_stats`	Flush information. Shard relocation resets this information.	`OBJECT`
`flush_stats['count']`	The total amount of flush operations that happened on the shard.	`BIGINT`
`flush_stats['periodic_count']`	The number of periodic flushes. Each periodic flush also counts as a regular flush. A periodic flush can happen after writes depending on settings like the translog flush threshold.	`BIGINT`
`flush_stats['total_time_ns']`	The total time spent on flush operations on the shard.	`BIGINT`
`refresh_stats`	Refresh information. Shard relocation resets this information.	`OBJECT`
`refresh_stats['count']`	The total amount of refresh operations that happened on the shard.	`BIGINT`
`refresh_stats['pending_count']`	The number of pending refresh operations.	`BIGINT`
`refresh_stats['total_time_ns']`	The total time spent on refresh operations on the shard (in nanoseconds).	`BIGINT`
`merge_stats`	Segment merge information. Shard relocation resets this information.	`OBJECT`
`merge_stats['count']`	The total amount of segment merge operations that happened on the shard.	`BIGINT`
`merge_stats['total_num_docs']`	The total number of documents processed by all segment merges.	`BIGINT`
`merge_stats['total_size_bytes']`	The total bytes processed by all segment merges.	`BIGINT`
`merge_stats['total_time_ms']`	The total time spent on segment merge operations on the shard (in milliseconds).	`BIGINT`
`merge_stats['total_throttled_time_ms']`	The total time segment merge IO writes were throttled (in milliseconds).	`BIGINT`
`merge_stats['current_count']`	The total amount of current ongoing segment merge operations.	`BIGINT`
`merge_stats['current_num_docs']`	The total number of documents all current segment merges are processing.	`BIGINT`
`merge_stats['current_size_bytes']`	The total bytes all current segment merges are processing.	`BIGINT`
`merge_stats['bytes_per_sec_auto_throttle']`	The actual throttle value. A value of `-1` indicates that auto throttling is being disabled.	`BIGINT`
`last_write_before`	A timestamp showing the latest time that a write operation could have executed against this shard. The value is held in memory and not persisted through node restarts, so it may only indicate the point at which the node holding the shard started, but the last write is guaranteed to have happened before this timestamp.	`TIMESTAMP WITH TIME ZONE`

Note

The sys.shards table is subject to Shard table permissions.

Example¶

For example, you can query shards like this:

cr> select schema_name as schema,
...   table_name as t,
...   id,
...   partition_ident as p_i,
...   num_docs as docs,
...   primary,
...   relocating_node as r_n,
...   routing_state as r_state,
...   state,
...   orphan_partition as o_p
... from sys.shards where table_name = 'locations' and id = 1;
+--------+-----------+----+-----+------+---------+------+---------+---------+-------+
| schema | t         | id | p_i | docs | primary | r_n  | r_state |  state  | o_p   |
+--------+-----------+----+-----+------+---------+------+---------+---------+-------+
| doc    | locations |  1 |     |    4 | TRUE    | NULL | STARTED | STARTED | FALSE |
+--------+-----------+----+-----+------+---------+------+---------+---------+-------+
SELECT 1 row in set (... sec)

Segments¶

The sys.segments table contains information about the Lucene segments of the shards.

The segment information is useful to understand the behaviour of the underlying Lucene file structures for troubleshooting and performance optimization of shards.

Column Name	Description	Return Type
`segment_name`	Name of the segment, derived from the segment generation and used internally to create file names in the directory of the shard.	`TEXT`
`shard_id`	ID of the effected shard.	`INTEGER`
`table_schema`	Schema name of the table of the shard.	`TEXT`
`table_name`	Table name of the shard.	`TEXT`
`partition_ident`	The partition ident of a partitioned table. Empty for non-partitioned tables.	`TEXT`
`node`	Information about the node the shard is located at.	`OBJECT`
`node['name']`	The name of the node the shard is located at.	`TEXT`
`node['id']`	The id of the node the shard is located at.	`TEXT`
`generation`	Generation number of the segment, increments for each segment written.	`LONG`
`num_docs`	Number of non-deleted Lucene documents in this segment. Documents inserted after the shard becomes `idle` are not included. A `refresh` is required to ensure they are counted. See refresh interval for more details about idle tables.	`INTEGER`
`deleted_docs`	Number of deleted Lucene documents in this segment.	`INTEGER`
`size`	Disk space used by the segment in bytes.	`LONG`
`memory`	Unavailable starting from CrateDB 5.0. Always returns -1.	`LONG`
`committed`	Indicates if the segments are synced to disk. Segments that are synced can survive a hard reboot.	`BOOLEAN`
`primary`	Describes if this segment is part of a primary shard.	`BOOLEAN`
`search`	Indicates if the segment is searchable. If `false`, the segment has most likely been written to disk but needs a refresh to be searchable.	`BOOLEAN`
`version`	Version of Lucene used to write the segment.	`TEXT`
`compound`	If `true`, Lucene merges all files from the segment into a single file to save file descriptors.	`BOOLEAN`
`merge_id`	The id of an ongoing merge that this segment is included in, or `NULL` if the segment is not being merged.	`TEXT`
`fully_merged_docs`	The number of Lucene documents in this segment that have cleared their temporary recovery data. This data is removed during merges for all documents that can safely be replicated from previously committed segments. If the segment is no longer open for searching, this value will be `-1`.	`INTEGER`
`attributes`	Contains information about whether high compression was enabled.	`OBJECT`

Note

The information in the sys.segments table is expensive to calculate and therefore this information should be retrieved with awareness that it can have performance implications on the cluster.

Note

The sys.shards table is subject to Shard table permissions.

Jobs, operations, and logs¶

To let you inspect the activities currently taking place in a cluster, CrateDB provides system tables that let you track current cluster jobs and operations. See Jobs Table and Operations Table.

Jobs and operations that finished executing are additionally recorded in memory. There are two retention policies available to control how many records should be kept.

One option is to configure the maximum number of records which should be kept. Once the configured table size is reached, the older log records are deleted as newer records are added. This is configurable using stats.jobs_log_size and stats.operations_log_size.

Another option is to configure an expiration time for the records. In this case, the records in the logs tables are periodically cleared if they are older than the expiry time. This behaviour is configurable using stats.jobs_log_expiration and stats.operations_log_expiration.

In addition to these retention policies, there is a memory limit in place preventing these tables from taking up too much memory. The amount of memory that can be used to store the jobs can be configured using stats.breaker.log.jobs.limit and stats.breaker.log.operations.limit. If the memory limit is reached, an error message will be logged and the log table will be cleared completely.

It is also possible to define a filter which must match for jobs to be recorded after they finished executing. This can be useful to only record slow queries or queries that failed due to an error. This filter can be configured using the stats.jobs_log_filer setting.

Furthermore, there is a second filter setting which also results in a log entry in the regular CrateDB log file for all finished jobs that match this filter. This can be configured using stats.jobs_log_persistent_filter. This could be used to create a persistent slow query log.

Jobs¶

The sys.jobs table is a constantly updated view of all jobs that are currently being executed in the cluster.

Table schema¶

Column Name	Description	Return Type
`id`	The job UUID. This job ID is generated by the system.	`TEXT`
`node`	Information about the node that created the job.	`OBJECT`
`node['id']`	The id of the node.	`TEXT`
`node['name']`	The name of the node.	`TEXT`
`session_id`	Id of a session that run this job. References `id` column of the sys.sessions table.	`INTEGER`
`started`	The point in time when the job started.	`TIMESTAMP WITH TIME ZONE`
`stmt`	Shows the data query or manipulation statement represented by this job.	`TEXT`
`username`	The user who is executing the statement.	`TEXT`

The field username corresponds to the SESSION_USER that is performing the query:

cr> select stmt, username, started from sys.jobs where stmt like 'sel% from %jobs%';
+---------------------------------------------------------------------------------+----------+-...-----+
| stmt                                                                            | username | started |
+---------------------------------------------------------------------------------+----------+-...-----+
| select stmt, username, started from sys.jobs where stmt like 'sel% from %jobs%' | crate    | ...     |
+---------------------------------------------------------------------------------+----------+-...-----+
SELECT 1 row in set (... sec)

Note

If the user management module is not available, the username is given as crate.

The same session can be used to run different statements. They are scheduled and run one after another. last_statement in the sys.sessions table shows only last scheduled statement. To get all statements currently running under specific session use query below. The first statement will be a statement that is currently running and other statements will be shown in scheduled order:

SELECT session_id, array_agg(stmt) AS job_ids
    FROM (
      SELECT session_id, stmt
      FROM sys.jobs
      ORDER BY started
    ) ordered
GROUP BY session_id

Every request that queries data or manipulates data is considered a “job” if it is a valid query. Requests that are not valid queries (for example, a request that tries to query a non-existent table) will not show up as jobs.

Note

The sys.jobs table is subject to sys jobs tables permissions.

Jobs metrics¶

The sys.jobs_metrics table provides an overview of the query latency in the cluster. Jobs metrics are not persisted across node restarts.

The metrics are aggregated for each node and each unique classification of the statements.

Note

In order to reduce the memory requirements for these metrics, the times are statistically sampled and therefore may have slight inaccuracies. In addition, durations are only tracked up to 10 minutes. Statements taking longer than that are capped to 10 minutes.

`sys.jobs_metrics` Table schema¶

Column Name	Description	Return Type
`node`	An object containing the id and name of the node on which the metrics have been sampled.	`OBJECT`
`classification`	An object containing the statement classification.	`OBJECT`
`classification['type']`	The general type of the statement. Types are: `INSERT`, `SELECT`, `UPDATE`, `DELETE`, `COPY`, `DDL`, and `MANAGEMENT`.	`TEXT`
`classification['labels']`	Labels are only available for certain statement types that can be classified more accurately than just by their type.	`TEXT_ARRAY`
`total_count`	Total number of queries executed	`BIGINT`
`total_affected_row_count`	Total number of rows affected by the queries	`BIGINT`
`failed_count`	Total number of queries that failed to complete successfully.	`BIGINT`
`sum_of_durations`	Sum of durations in ms of all executed queries per statement type.	`BIGINT`
`stdev`	The standard deviation of the query latencies	`DOUBLE PRECISION`
`mean`	The mean query latency in ms	`DOUBLE PRECISION`
`max`	The maximum query latency in ms	`BIGINT`
`min`	The minimum query latency in ms	`BIGINT`
`percentiles`	An object containing different percentiles	`OBJECT`

Classification¶

Certain statement types (such as SELECT statements) have additional labels in their classification. These labels are the names of the logical plan operators that are involved in the query.

For example, the following UNION statement:

SELECT name FROM t1 where id = 1
UNION ALL
SELECT name FROM t2 where id < 2

would result in the following labels:

Union` for the UNION ALL
Get for the left SELECT
Collect for the right SELECT

Note

Labels may be subject to change as they only represent internal properties of the statement!

Operations¶

The sys.operations table is a constantly updated view of all operations that are currently being executed in the cluster:

cr> select node['name'], job_id, name, used_bytes from sys.operations
... order by name limit 1;
+--------------+--------...-+-----...-+------------+
| node['name'] | job_id     | name    | used_bytes |
+--------------+--------...-+-----...-+------------+
| crate        | ...        | ...     | ...        |
+--------------+--------...-+-----...-+------------+
SELECT 1 row in set (... sec)

An operation is a node-specific sub-component of a job (for when a job involves multi-node processing). Jobs that do not require multi-node processing will not produce any operations.

Table schema¶

Column Name	Description	Return Type
`id`	The operation UUID. This operation ID is generated by the system.	`TEXT`
`job_id`	The job id this operation belongs to.	`TEXT`
`name`	The name of the operation.	`TEXT`
`node`	Information about the node that created the operation.	`OBJECT`
`node['id']`	The id of the node.	`TEXT`
`node['name']`	The name of the node.	`TEXT`
`started`	The point in time when the operation started.	`TIMESTAMP WITH TIME ZONE`
`used_bytes`	Currently loaded amount of data by the operation.	`BIGINT`

Note

In some cases, operations are generated for internal CrateDB work that does not directly correspond to a user request. These entries do not have corresponding entries in sys.jobs.

Logs¶

The sys.jobs and sys.operations tables have corresponding log tables: sys.jobs_log and sys.operations_log.

`sys.jobs_log` Table schema¶

Column Name	Description	Return Type
`id`	The job ID.	`TEXT`
`ended`	The point in time when the job finished.	`TIMESTAMP WITH TIME ZONE`
`error`	If the job encountered an error, this will hold the error message.	`TEXT`
`started`	The point in time when the job started.	`TIMESTAMP WITH TIME ZONE`
`stmt`	Shows the data query or manipulation statement executed by the job.	`TEXT`
`username`	The user who executed the statement.	`TEXT`
`classification`	An object containing the statement classification.	`OBJECT`
`classification['type']`	The general type of the statement. Types are: `INSERT`, `SELECT`, `UPDATE`, `DELETE`,``COPY``, `DDL`, and `MANAGEMENT`.	`TEXT`
`classification['labels']`	Labels are only available for certain statement types that can be classified more accurately than just by their type.	`TEXT_ARRAY`
`node`	Information about the node that created the job.	`OBJECT`
`node['id']`	The id of the node.	`TEXT`
`node['name']`	The name of the node.	`TEXT`
`session_id`	Id of a session that run this job. References `id` column of the sys.sessions table.	`INTEGER`

Note

You can control which jobs are recorded using the stats.jobs_log_filter

Note

The sys.jobs_log table is subject to sys jobs tables permissions.

To get all statements that were run under specific session use a query below:

SELECT session_id, array_agg(stmt) AS job_ids
    FROM (
      SELECT session_id, stmt
      FROM sys.jobs_log
      ORDER BY started
    ) ordered
GROUP BY session_id

`sys.operations_log` Table schema¶

Column Name	Description	Return Type
`id`	The operation ID.	`TEXT`
`job_id`	The job id.	`TEXT`
`ended`	The point in time when the operation finished.	`TIMESTAMP WITH TIME ZONE`
`error`	If the operation encountered an error, this will hold the error message.	`TEXT`
`name`	The name of the operation.	`TEXT`
`started`	The point in time when the operation started.	`TIMESTAMP WITH TIME ZONE`
`used_bytes`	The amount of data loaded by the operation.	`BIGINT`

After a job or operation finishes, the corresponding entry will be moved into the corresponding log table:

cr> select id, stmt, username, started, ended, error
... from sys.jobs_log order by ended desc limit 2;
+-...+----------------------------------------------...-+----------+-...-----+-...---+-------+
| id | stmt                                             | username | started | ended | error |
+-...+----------------------------------------------...-+----------+-...-----+-...---+-------+
| ...| select node['name'], ...                         | crate    | ...     | ...   |  NULL |
| ...| select stmt, username, started from sys.jobs ... | crate    | ...     | ...   |  NULL |
+-...+----------------------------------------------...-+----------+-...-----+-...---+-------+
SELECT 2 rows in set (... sec)

Invalid queries are also logged in the sys.jobs_log table, i.e. queries that never make it to the sys.jobs table because they could not be executed.

The log tables are bound by a fixed size (stats.jobs_log_size) or by an expiration time (stats.jobs_log_expiration)

See Collecting stats for information on how to configure logs.

Caution

If you deactivate statistics tracking, the logs tables will be truncated.

Cluster checks¶

The table sys.checks exposes a list of internal cluster checks and results of their validation.

The sys.checks table looks like this:

Column Name	Description	Return Type
`id`	The unique check id.	`INTEGER`
`severity`	The level of severity. The higher the value of the field the higher severity.	`INTEGER`
`description`	The description message for the setting check.	`TEXT`
`passed`	The flag determines whether the check for the setting has passed.	`BOOLEAN`

Here’s an example query:

cr> select id, description from sys.checks order by id;
+----+--------------------------------------------------------------...-+
| id | description                                                      |
+----+--------------------------------------------------------------...-+
|  2 | The total number of partitions of one or more partitioned tab... |
|  3 | The following tables need to be recreated for compatibility w... |
+----+--------------------------------------------------------------...-+
SELECT 2 rows in set (... sec)

Cluster checks are also indicated in the CrateDB admin console. When all cluster checks (and all Node checks) pass, the Checks icon will be green. Here’s what it looks like when some checks are failing at the CRITICAL severity level:

Current Checks¶

Number of partitions¶

This check warns if any partitioned table has more than 1000 partitions to detect the usage of a high cardinality field for partitioning.

Tables need to be recreated¶

Warning

Do not attempt to upgrade your cluster to a newer major version if this cluster check is failing. Follow the instructions below to get this cluster check passing.

This check warns you if your cluster contains tables that you need to reindex before you can upgrade to a future major version of CrateDB.

If you try to upgrade to a later major CrateDB version without reindexing the tables, CrateDB will refuse to start.

CrateDB table version compatibility scheme¶

CrateDB maintains backward compatibility for tables created in majorVersion - 1:

Table Origin	Current Version	Current Version	Current Version
	4.x	5.x	6.x
4.x	✔️	✔️	❌
5.x	❌	✔️	✔️
6.x	❌	❌	✔️

Avoiding reindex using partitioned tables¶

Reindexing tables is an expensive operation which can take a long time. If you are storing time series data for a certain retention period and intend to delete old data, it is possible to use the partitioned tables to avoid reindex operations.

You will have to use a partition column that denotes time. For example, if you have a retention period of nine months, you could partition a table by a month column. Then, every month, the system will create a new partition. This new partition is created using the active CrateDB version and is compatible with the next major CrateDB version. Now to achieve your goal of avoiding a reindex, you must manually delete any partition older than nine months. If you do that, then after nine months you rolled through all partitions and the remaining nine are compatible with the next major CrateDB version.

How to reindex¶

Use SHOW CREATE TABLE to get the schema required to create an empty copy of the table to recreate:

+---------------------------------------------+
| SHOW CREATE TABLE rx.metrics                |
+---------------------------------------------+
| CREATE TABLE IF NOT EXISTS "rx"."metrics" ( |
|    "id" TEXT NOT NULL,                      |
|    "temperature" REAL,                      |
|    PRIMARY KEY ("id")                       |
| )                                           |
| CLUSTERED BY ("id") INTO 4 SHARDS           |
| WITH (                                      |
|    column_policy = 'strict',                |
|    number_of_replicas = '0-1'               |
| )                                           |
+---------------------------------------------+
SHOW 1 row in set (... sec)

Create a new temporary table, using the schema returned from SHOW CREATE TABLE:

cr> CREATE TABLE rx.tmp_metrics (id TEXT PRIMARY KEY, temperature REAL);
CREATE OK, 1 row affected (... sec)

Copy the data:

cr> INSERT INTO rx.tmp_metrics (id, temperature) (SELECT id, temperature FROM rx.metrics);
INSERT OK, 2 rows affected (... sec)

cr> REFRESH TABLE rx.tmp_metrics;
REFRESH OK, 1 row affected  (... sec)

Swap the tables:

cr> ALTER CLUSTER SWAP TABLE rx.tmp_metrics TO rx.metrics;
ALTER OK, 1 row affected  (... sec)

Confirm the new your_table contains all data and has the new version:

cr> SELECT count(*) FROM rx.metrics;
+-------+
| count |
+-------+
|     2 |
+-------+
SELECT 1 row in set (... sec)

cr> SELECT version['created'] FROM information_schema.tables
... WHERE table_schema = 'rx' AND table_name = 'metrics';
+--------------------+
| version['created'] |
+--------------------+
| 6.4.0              |
+--------------------+
SELECT 1 row in set (... sec)

Drop the old table, as it is now obsolete:

cr> DROP TABLE rx.tmp_metrics;
DROP OK, 1 row affected  (... sec)

After you reindexed all tables, this cluster check will pass.

Note

Snapshots of your tables created prior to them being upgraded will not work with future versions of CrateDB. For this reason, you should create a new snapshot for each of your tables. (See Snapshots.)

License check¶

Note

This check was removed in version 4.5 because CrateDB no longer requires an enterprise license, see also Farewell to the CrateDB Enterprise License.

This check warns you when your license is close to expiration, is already expired, or if the cluster contains more nodes than allowed by your license. It will yield a MEDIUM alert when your license is valid for less than 15 days and a HIGH alert when your license is valid for less than a day. All other cases, like already expired or max-nodes-violation, it will result in a HIGH alert. We recommend that you request a new license when this check triggers, in order to avoid the situation where operations are rejected due to an invalid license.

Health¶

The sys.health table lists the health of each table and table partition. The health is computed by checking the states of the shard of each table/partition.

Column Name	Description	Return Type
`table_name`	The table name.	`TEXT`
`table_schema`	The schema of the table.	`TEXT`
`partition_ident`	The ident of the partition. NULL for non-partitioned tables.	`TEXT`
`health`	The health label. Can be RED, YELLOW or GREEN.	`TEXT`
`severity`	The health as a `smallint` value. Useful when ordering on health.	`SMALLINT`
`missing_shards`	The number of not assigned or started shards.	`INTEGER`
`underreplicated_shards`	The number of shards which are not fully replicated.	`INTEGER`

Both missing_shards and underreplicated_shards might return -1 if the cluster is in an unhealthy state that prevents the exact number from being calculated. This could be the case when the cluster can’t elect a master, because there are not enough eligible nodes available.

cr> select * from sys.health order by severity desc, table_name;
+--------+----------------+-----------------+----------+------------+--------------+------------------------+
| health | missing_shards | partition_ident | severity | table_name | table_schema | underreplicated_shards |
+--------+----------------+-----------------+----------+------------+--------------+------------------------+
| GREEN  |              0 |            NULL |        1 | locations  | doc          |                      0 |
| GREEN  |              0 |            NULL |        1 | quotes     | doc          |                      0 |
+--------+----------------+-----------------+----------+------------+--------------+------------------------+
SELECT 2 rows in set (... sec)

The health with the highest severity will always define the health of the query scope.

Example of getting a cluster health (health of all tables):

cr> select health from sys.health order by severity desc limit 1;
+--------+
| health |
+--------+
| GREEN  |
+--------+
SELECT 1 row in set (... sec)

Health definition¶

Health	Description
`RED`	At least one primary shard is missing (primary shard not started or unassigned).
`YELLOW`	At least one shard is underreplicated (replica shard not started or unassigned).
`GREEN`	All primary and replica shards have been started.

Note

The sys.health table is subject to Shard table permissions as it will expose a summary of table shard states.

Cluster Health¶

The sys.cluster_health table returns the health of the entire cluster. Only a single entry is returned, containing the overall health of the cluster, including the overall number of missing shards and underreplicated shards of all tables. Any table-specific health issues, exposed by the sys.health will be reflected here as well.

Column Name	Description	Return Type
`health`	The cluster health label. Can be RED, YELLOW or GREEN.	`TEXT`
`severity`	The health as a `smallint` value. Useful when ordering on health.	`SMALLINT`
`description`	A description of the current health.	`TEXT`
`pending_tasks`	The number of pending cluster state update tasks.	`BIGINT`
`missing_shards`	The number of unassigned or not started shards of all tables.	`BIGINT`
`underreplicated_shards`	The number of shards which are not fully replicated of all tables.	`BIGINT`

cr> select * from sys.cluster_health order by severity desc;
+-------------+--------+----------------+---------------+----------+------------------------+
| description | health | missing_shards | pending_tasks | severity | underreplicated_shards |
+-------------+--------+----------------+---------------+----------+------------------------+
|             | GREEN  |              0 |             0 |        1 |                      0 |
+-------------+--------+----------------+---------------+----------+------------------------+
SELECT 1 row in set (... sec)

The health with the highest severity will always define the health of the query scope.

Repositories¶

The table sys.repositories lists all configured repositories that can be used to create, manage and restore snapshots (see Snapshots).

Column Name	Description	Return Type
`name`	The repository name	`TEXT`
`type`	The type of the repository determining how and where the repository stores its snapshots.	`TEXT`
`settings`	The configuration settings the repository has been created with. The specific settings depend on the repository type, see CREATE REPOSITORY.	`OBJECT`

cr> SELECT name, type, settings FROM sys.repositories
... ORDER BY name;
+---------+------+---------------------------------------------------...--+
| name    | type | settings                                               |
+---------+------+---------------------------------------------------...--+
| my_repo | fs   | {"compress": "true", "location": "repo_location", ...} |
+---------+------+---------------------------------------------------...--+
SELECT 1 row in set (... sec)

Note

Sensitive user account information will be masked and thus not visible to the user.

Snapshots¶

The table sys.snapshots lists all existing snapshots in all configured repositories (see Snapshots).

Column Name	Description	Return Type
`id`	UUID of the snapshot	`TEXT`
`name`	The name of the snapshot	`TEXT`
`repository`	The name of the repository that contains this snapshot.	`TEXT`
`concrete_indices`	Contains the names of all tables and partitions that are contained in this snapshot how they are represented as ES index names.	`ARRAY(TEXT)`
`tables`	Contains the fully qualified names of all tables within the snapshot.	`ARRAY(TEXT)`
`relations`	Contains the `table_schema` and `table_name` of all tables within the snapshot.	`ARRAY(OBJECT)`
`table_partitions`	Contains the table schema, table name and partition values of partitioned tables within the snapshot.	`ARRAY(OBJECT)`
`started`	The point in time when the creation of the snapshot started. Changes made after that are not stored in this snapshot.	`TIMESTAMP WITH TIME ZONE`
`finished`	The point in time when the snapshot creation finished.	`TIMESTAMP WITH TIME ZONE`
`state`	The current state of the snapshot. One of: `IN_PROGRESS`, `SUCCESS`, `PARTIAL`, or `FAILED`.	`TEXT`
`version`	An internal version this snapshot was created with.	`TEXT`
`failures`	A list of failures that occurred while taking the snapshot. If taking the snapshot was successful this is empty.	`ARRAY(TEXT)`
`reason`	Snapshot failure reason. If the snapshot was successful this is `NULL`.	`TEXT`
`total_shards`	Total number of primary shards in the snapshot.	`INTEGER`
`include_global_state`	Flag indicating that the snapshot was taken with the `ALL` option.	`BOOLEAN`

Snapshot/Restore operates on a per-shard basis. Hence, the state column indicates whether all (SUCCESS), some (PARTIAL), or no shards(FAILED) have been backed up. PARTIAL snapshots are the result of some primaries becoming unavailable while taking the snapshot when there are no replicas at hand (cluster state is RED). If there are replicas of the (now unavailable) primaries (cluster state is YELLOW) the snapshot succeeds and all shards are included (state SUCCESS). Building on a PARTIAL snapshot will include all primaries again.

Warning

In case of a PARTIAL state another snapshot should be created in order to guarantee a full backup! Only SUCCESS includes all shards.

The concrete_indices column contains the names of all Elasticsearch indices that were stored in the snapshot. A normal CrateDB table maps to one Elasticsearch index, a partitioned table maps to one Elasticsearch index per partition. The mapping follows the following pattern:

CrateDB table / partition name	`concrete_indices` entry
`doc.my_table`	`my_table`
`my_schema.my_table`	`my_schema.my_table`
`doc.parted_table` (value=null)	`.partitioned.my_table.0400`
`my_schema.parted_table` (value=null)	`my_schema..partitioned.my_table.0400`

cr> SELECT "repository", name, state, concrete_indices
... FROM sys.snapshots order by "repository", name;
+------------+-------------+---------+-----------------...-+
| repository | name        | state   | concrete_indices    |
+------------+-------------+---------+-----------------...-+
| my_repo    | my_snapshot | SUCCESS | [...]               |
+------------+-------------+---------+-----------------...-+
SELECT 1 row in set (... sec)

Snapshot Restore¶

The sys.snapshot_restore table contains information about the current state of snapshot restore operations.

pg_stats schema¶
Name	Description	Type
`id`	The `UUID` of the restore snapshot operation.	`TEXT`
`repository`	The name of the repository that contains the snapshot.	`TEXT`
`snapshot`	The name of the snapshot.	`TEXT`
`state`	The current state of the snapshot restore operations. Possible states are: `INIT`, `STARTED`, `SUCCESS`, and `FAILURE`.	`TEXT`
`shards['table_schema']`	The schema name of the table of the shard.	`TEXT`
`shards['table_name']`	The table name of the shard.	`TEXT`
`shards['partition_ident']`	The identifier of the partition of the shard. `NULL` if the is not partitioned.	`TEXT`
`shards['shard_id']`	The ID of the shard.	`INTEGER`
`shards['state']`	The restore state of the shard. Possible states are: `INIT`, `STARTED`, `SUCCESS`, and `FAILURE`.	`TEXT`

To get more information about the restoring snapshots and shards one can join the sys.snapshot_restore with sys.shards or sys.snapshots table.

Summits¶

The sys.summits table contains the information about the mountains in the Alps higher than 2000m. The mountain names from the table are also used to generate random nodes names.

Users¶

The sys.users table contains all existing database users in the cluster.

Column Name	Description	Return Type
`name`	The name of the database user.	`TEXT`
`superuser`	Flag to indicate whether the user is a superuser.	`BOOLEAN`
`password`	`********` if there is a password set or `NULL` if there is not.	`TEXT`
`granted_roles`	A list of parent roles granted to the user	`ARRAY`
`granted_roles[role]`	The name of the role granted to the user	`TEXT`
`granted_roles[grantor]`	The name of user who granted the role to the user	`TEXT`
`jwt`	JWT authentication properties	`OBJECT`
`jwt[aud]`	Recipient that the JWT is intended for	`TEXT`
`jwt[iss]`	JWK endpoint URL	`TEXT`
`jwt[username]`	User name in a third party app	`TEXT`

Roles¶

The sys.roles table contains all existing database roles in the cluster.

Column Name	Description	Return Type
`name`	The name of the database user.	`TEXT`
`granted_roles`	A list of parent roles granted to the user	`ARRAY`
`granted_roles[role]`	The name of the role granted to the user	`TEXT`
`granted_roles[grantor]`	The name of user who granted the role to the user	`TEXT`

Privileges¶

The sys.privileges table contains all privileges for each user and role of the database.

Column Name	Description	Return Type
`class`	The class on which the privilege applies	`TEXT`
`grantee`	The name of the database user or role for which the privilege is granted or denied	`TEXT`
`grantor`	The name of the database user who granted or denied the privilege	`TEXT`
`ident`	The name of the database object on which the privilege applies	`TEXT`
`state`	Either `GRANT` or `DENY`, which indicates if the user or role has been granted or denied access to the specific database object	`ARRAY`
`type`	The type of access for the specific database object	`TEXT`

Sessions¶

Column Name	Description	Return Type
`id`	The unique identifier of the session within a single node. Use together with `handler_node` to uniquely identify sessions across a cluster	`INTEGER`
`auth_user`	The user which was authenticated for the session	`TEXT`
`session_user`	The user, possibly different than the `auth_user` which is currently active in the session, see: SET AND RESET SESSION AUTHORIZATION	`TEXT`
`handler_node`	The name of the node on which the session is created	`TEXT`
`client_address`	The IPv4 or IPv6 network address of the client which opened the session	`TEXT`
`time_created`	The time on which the session was created	`TIMESTAMPTZ`
`protocol`	`http` or `pg` to denote weather the session is created through HTTP endpoint or PostgreSQL wire protocol	`TEXT`
`ssl`	A flag which denotes if `SSL` encryption is used between the client which opened the session and the `handler_node`	`BOOLEAN`
`settings`	The session settings	`OBJECT`
`last_statement`	The last SQL statement which was executed in the session	`TEXT`
`last_job_id`	UUID of the last job which was executed in the session. References the `id` column of the Jobs and Logs tables.	`TEXT`

Allocations¶

The sys.allocations table contains information about shards and their allocation state. The table contains:

shards that are unassigned and why they are unassigned
shards that are assigned but cannot be moved or rebalanced and why they remain on their current node

It can help to identify problems if shard allocations behave different than expected, e.g. when a shard stays unassigned or a shard does not move off a node.

Column Name	Description	Return Type
`table_schema`	Schema name of the table of the shard.	`TEXT`
`table_name`	Table name of the shard.	`TEXT`
`partition_ident`	Identifier of the partition of the shard. `NULL` if the table is not partitioned.	`TEXT`
`shard_id`	ID of the effected shard.	`INTEGER`
`node_id`	ID of the node on which the shard resides. `NULL` if the shard is unassigned.	`TEXT`
`primary`	Whether the shard is a primary shard.	`BOOLEAN`
`current_state`	Current state of the shard. Possible states are: `UNASSIGNED`, `INITIALIZING`, `STARTED`, `RELOCATING`	`TEXT`
`explanation`	Explanation why the shard cannot be allocated, moved or rebalanced.	`TEXT`
`decisions`	A list of decisions that describe in detail why the shard in the current state.	`ARRAY`
`decisions['node_id']`	ID of the node of the decision.	`TEXT`
`decisions['node_name']`	Name of the node of the decision.	`TEXT`
`decisions['explanations']`	Detailed list of human readable explanations why the node decided whether to allocate or rebalance the shard. Returns `NULL` if there is no need to rebalance the shard.	`ARRAY`

Note

The sys.allocations table is subject to Shard table permissions.

Shard table permissions¶

Accessing tables that return shards (sys.shards, sys.allocations) is subjected to the same privileges constraints as the other tables. Namely, in order to query them, the connected user needs to have the DQL privilege on that particular table, either directly or inherited from the SCHEMA or CLUSTER (for more information on privileges inheritance see Hierarchical Inheritance of Privileges).

However, being able to query shard returning system tables will not allow the user to retrieve all the rows in the table, as they may contain information related to tables, which the connected user does not have any privileges for. The only rows that will be returned will be the ones the user is allowed to access.

For example, if the user john has any privilege on the doc.books table but no privilege at all on doc.locations, when john issues a SELECT * FROM sys.shards statement, the shards information related to the doc.locations table will not be returned.

`sys` jobs tables permissions¶

Accessing sys.jobs and sys.jobs_log tables is subjected to the same privileges constraints as other tables. To query them, the current user needs to have the DQL privilege on that particular table, either directly or inherited from the SCHEMA or CLUSTER.

A user that doesn’t have superuser privileges is allowed to retrieve only their own job logs entries, while a user with superuser privileges has access to all.

pg_stats¶

The pg_stats table in the pg_catalog system schema contains statistical data about the contents of the CrateDB cluster.

Entries are periodically created or updated in the interval configured with the stats.service.interval setting.

Alternatively the statistics can also be updated using the ANALYZE command.

The table contains 1 entry per column for each table in the cluster which has been analyzed.

pg_stats schema¶
Name	Type	Description
schemaname	text	Name of the schema containing the table.
tablename	text	Name of the table.
attname	text	Name of the column.
inherited	bool	Always false in CrateDB; For compatibility with PostgreSQL.
null_frac	real	Fraction of column entries that are null.
avg_width	integer	Average size in bytes of column’s entries.
n_distinct	real	An approximation of the number of distinct values in a column.
most_common_vals	string[]	A list of the most common values in the column. null if no values seem. more common than others.
most_common_freqs	real[]	A list of the frequencies of the most common values. The size of the array always matches most_common_vals. If most_common_vals is null this is null as well.
histogram_bounds	string[]	A list of values that divide the column’s values into groups of approximately equal population. The values in most_common_vals, if present, are omitted from this histogram calculation.
correlation	real	Always 0.0. This column exists for PostgreSQL compatibility.
most_common_elems	string[]	Always null. Exists for PostgreSQL compatibility.
most_common_elem_freqs	real[]	Always null. Exists for PostgreSQL compatibility.
elem_count_histogram	real[]	Always null. Exists for PostgreSQL compatibility.

Note

Not all data types support creating statistics. So some columns may not show up in the table.

pg_publication¶

The pg_publication table in the pg_catalog system schema contains all publications created in the cluster.

Column Name	Description	Return Type
`oid`	Row identifier.	`INTEGER`
`pubname`	Name of the publication.	`TEXT`
`pubowner`	`oid` of the owner of the publication.	`INTEGER`
`puballtables`	Whether this publication includes all tables in the cluster, including tables created in the future.	`BOOLEAN`
`pubinsert`	Whether `INSERT` operations are replicated for tables in the publication. Always `true`.	`BOOLEAN`
`pubupdate`	Whether `UPDATE` operations are replicated for tables in the publication. Always `true`.	`BOOLEAN`
`pubdelete`	Whether `DELETE` operations are replicated for tables in the publication. Always `true`.	`BOOLEAN`

pg_publication_tables¶

The pg_publication_tables table in the pg_catalog system schema contains tables replicated by a publication.

Column Name	Description	Return Type
`pubname`	Name of the publication.	`TEXT`
`schemaname`	Name of the schema containing table.	`TEXT`
`tablename`	Name of the table.	`TEXT`

pg_subscription¶

The pg_subscription table in the pg_catalog system schema contains all subscriptions created in the cluster.

Column Name	Description	Return Type
`oid`	Row identifier.	`INTEGER`
`subdbid`	noop value, always `0`.	`INTEGER`
`subname`	Name of the subscription.	`TEXT`
`subowner`	`oid` of the owner of the subscription.	`INTEGER`
`subenabled`	Whether the subscription is enabled, always `true`.	`BOOLEAN`
`subbinary`	Noop value, always `true`.	`BOOLEAN`
`substream`	Noop value, always `true`.	`BOOLEAN`
`subconninfo`	Connection string to the publishing cluster.	`TEXT`
`subslotname`	Noop value, always `NULL`.	`TEXT`
`subsynccommit`	Noop value, always `NULL`.	`TEXT`
`subpublications`	Array of subscribed publication names. These publications are defined in the publishing cluster.	`ARRAY`

pg_subscription_rel¶

The pg_subscription_rel table in the pg_catalog system schema contains the state for each replicated relation in each subscription.

Column Name	Description	Return Type
`srsubid`	Reference to subscription.	`INTEGER`
`srrelid`	Reference to relation.	`REGCLASS`
`srsubstate`	Replication state of the relation. State code: `i` - initializing; `d` - restoring; `r` - monitoring, i.e. waiting for new changes; `e` - error.	`TEXT`
`srsubstate_reason`	Error message if there was a replication error for the relation or `NULL`.	`TEXT`
`srsublsn`	Noop value, always `NULL`.	`LONG`

System information¶

Cluster¶

Cluster settings¶

Nodes¶

id¶

name¶

hostname¶

rest_url¶

is_master¶

roles¶

attributes¶

port¶

load¶

mem¶

heap¶

version¶

cluster_state_version¶

fs¶

thread_pools¶

os¶

cgroup limitations¶

Uptime limitations¶

os_info¶

connections¶

process¶

Node checks¶

Acknowledge failed checks¶

Description of checked node settings¶

Recovery expected data nodes¶

Recovery after data nodes¶

Recovery after time¶

Routing allocation disk watermark high¶

Routing allocation disk watermark low¶

Maximum shards per node¶

Shards¶

Table schema¶

Example¶

Segments¶

Jobs, operations, and logs¶

Jobs¶

Table schema¶

Jobs metrics¶

sys.jobs_metrics Table schema¶

Classification¶

Operations¶

Table schema¶

Logs¶

sys.jobs_log Table schema¶

sys.operations_log Table schema¶

Cluster checks¶

Current Checks¶

Number of partitions¶

Tables need to be recreated¶

CrateDB table version compatibility scheme¶

Avoiding reindex using partitioned tables¶

How to reindex¶

License check¶

Health¶

Health definition¶

Cluster Health¶

Repositories¶

Snapshots¶

Snapshot Restore¶

Summits¶

Users¶

Roles¶

Privileges¶

Sessions¶

Allocations¶

Shard table permissions¶

sys jobs tables permissions¶

pg_stats¶

pg_publication¶

pg_publication_tables¶

pg_subscription¶

pg_subscription_rel¶

`id`¶

`name`¶

`hostname`¶

`rest_url`¶

`is_master`¶

`roles`¶

`attributes`¶

`port`¶

`load`¶

`mem`¶

`heap`¶

`version`¶

`cluster_state_version`¶

`fs`¶

`thread_pools`¶

`os`¶

`os_info`¶

`connections`¶

`process`¶

`sys.jobs_metrics` Table schema¶

`sys.jobs_log` Table schema¶

`sys.operations_log` Table schema¶

`sys` jobs tables permissions¶