The crate-node
command¶
Use the crate-node command to troubleshoot CrateDB cluster nodes. Using this command, you can:
Repurpose nodes and clean up their old data.
Force the election of a master node (and the creation of a new cluster) in the event that you lose too many nodes to be able to form a quorum.
Detach nodes from an old cluster so they can be moved to a new cluster.
Table of contents
Repurpose a node¶
About
In a situation where you have irrecoverably lost the majority of the master-eligible nodes in a cluster, you may need to form a new cluster. When forming a new cluster, you may have to change the role of one or more nodes. Changing the role of a node is referred to as repurposing a node.
Each node checks the contents of its data path at startup. If CrateDB discovers unexpected data, it will refuse to start. The specific rules are:
Nodes configured with node.data set to
false
will refuse to start if they find any shard data at startup.Nodes configured with both node.master set to
false
and node.data set tofalse
will refuse to start if they have any index metadata at startup.
The crate-node repurpose command can help you clean up the necessary node data, so that CrateDB can be restarted with a new role.
Procedure
To repurpose a node, first of all, you must stop the node.
Then, update the settings node.data and node.master in the crate.yml
configuration file as needed.
The node.data
and node.master
settings can be configured in four
different ways, each corresponding to a different type of node.
Role |
Configuration |
After repurposing |
|
---|---|---|---|
Shard data |
Index metadata |
||
Master-eligible |
node.data: true
node.master: true
|
— |
— |
Master-only |
node.master: true
node.data: false
|
Deleted |
— |
Data-only |
node.data: true
node.master: false
|
— |
Deleted |
Coordination-only |
node.data: false
node.master: false
|
Deleted |
Deleted |
The final column in the above table indicates what data (if any) will be deleted (i.e., “cleaned up”) after repurposing the node to that configuration.
Warning
Before running the repurpose
command, make sure that any data you want
to keep is available on other nodes in the cluster.
Then, invoke the repurpose
command.
sh$ ./bin/crate-node repurpose
Found 2 shards in 2 tables to clean up.
Use -v to see a list of paths and tables affected.
Node is being repurposed as master and no-data. Clean-up of shard data will
be performed.
Do you want to proceed?
Confirm [y/N] y
Node successfully repurposed to master and no data.
As mentioned in the command output, you can pass in -v
to get a more
verbose output.
sh$ ./bin/crate-node repurpose -v
Finally, start the node again. After that, the node has been successfully repurposed.
Perform an unsafe cluster bootstrap¶
About
When communication is lost between one or more nodes in a cluster (e.g., during a network partition), the situation is assumed to be temporary and safeguards exist to prevent the election of a master node unless a quorum can be established.
However, if the situation is permanent (i.e., you have irrecoverably lost a majority of the nodes in your cluster), also known as a split-brain situation, you will need to force the election of a master. Forcing a master election without quorum is referred to as an unsafe cluster bootstrap.
The unsafe-bootstrap command can support you to choose a new master node and subsequently perform an unsafe cluster bootstrap.
Warning
An unsafe bootstrap should be your last resort.
When you perform an unsafe bootstrap, you are effectively abandoning the data on any unreachable nodes. This may result in arbitrary data loss and inconsistencies.
Before you attempt this, we recommend you try one or both of the following:
Build a new cluster from a recent snapshot and then re-import any data that was ingested since the snapshot was taken.
Recreate lost nodes using a copy of the data kept in the CRATE_HOME directory, if you still have access to the file system.
Procedure
Before you continue, you must stop all master-eligible nodes in the cluster.
Caution
The unsafe-bootstrap
command will return an error message if the node
you issue it from is still running. However, it does not check the running
status of any other nodes in the cluster. You must verify the cluster state
for yourself before proceeding.
Once all master-eligible nodes in the cluster have been stopped, you can manually select a new master.
To support you selecting a new master node, the unsafe-bootstrap
command
returns information about the node cluster state as a pair of values in the
form (term, version).
You can gather this information (safely) by issuing the unsafe-bootstrap
command and answering “no” (n
) at the confirmation prompt.
sh$ ./bin/crate-node unsafe-bootstrap
WARNING: CrateDB MUST be stopped before running this tool.
Current node cluster state (term, version) pair is (4, 12)
Do you want to proceed?
Confirm [y/N] n
Here, the node cluster state has a term value of 4
and a version value of
12
.
Run this command on every master-eligible node in the cluster (making sure to answer “no” each time) and make a note of each respective value pair.
Once you’re done, select the node with the highest term value. If multiple nodes share the highest term value, select the one with the highest version value. If multiple nodes share the highest term value and the highest version value, select any one of them.
Note
Selecting the node with the highest state values (per the above) ensures that you elect a master node with the freshest state data. This, in turn, minimizes the potential for data loss and inconsistency.
Once you have selected a node to elect to master, invoke the unsafe-bootstrap
command on that node and answer yes (y
) at the confirmation prompt.
sh$ ./bin/crate-node unsafe-bootstrap
WARNING: CrateDB MUST be stopped before running this tool.
Current node cluster state (term, version) pair is (4, 12)
Do you want to proceed?
Confirm [y/N] y
If the operation was successful, the program will acknowledge it. Note: This success message indicates that the operation was completed. You may still experience data loss and inconsistencies.
Master node was successfully bootstrapped
Now, start the bootstrapped node and verify that it has started a new cluster with one node and elected itself as the master.
Before you can add the rest of the nodes to the new cluster, you must detach them from the old cluster (see the next section).
After that’s done, start the nodes and verify that they join the new cluster.
Note
Once the new cluster is up-and-running and all recoveries are complete, you are advised to assess the database for data loss and inconsistencies.
Detach a node from its cluster¶
About
To protect nodes from inadvertently rejoining the wrong cluster (e.g., in the event of a network partition), each node binds to the first cluster it joins.
However, if a cluster has permanently failed (see the previous section) you must detach nodes before you can move them to a a new cluster.
The detach-cluster command supports you moving a node to a new cluster by resetting the cluster it is bound to (i.e., detaching it from its existing cluster).
Warning
Do not attempt to move a node from one logical cluster to another. You cannot merge two clusters in this fashion.
You should only detach a node subsequent to performing an unsafe cluster bootstrap.
Procedure
To detach a node, run:
sh$ ./bin/crate-node detach-cluster
WARNING: CrateDB MUST be stopped before running this tool.
Do you want to proceed?
Confirm [y/N] y
A corresponding message confirms success.
Node was successfully detached from the cluster.
When the node is started again, it will be able to join a new cluster.
Note
You may also have to update the discovery configuration, so that nodes are able to find the new cluster.