Bulk Inserts¶
If you have a lot of pre-existing data that you need to import in bulk, follow this how-to guide if you want to improve performance.
Briefly, when importing data in bulk, keep the following in mind:
Reduce the number of replicas as much as possible (ideally zero) as replication slows down the import process significantly
Only use as many shards as you need
Disable the periodic table refresh by setting the refresh interval to zero during import
Adjust the bulk size of the import as necessary given the number of records you are importing
Import table partitions separately using the
PARTITION
clause in theCOPY TO
statement
The rest of this document goes into more detail.
Tip
Increasing disk IOPS is the best, and perhaps most straightforward, way to increase import speed. SSDs are recommended (as general good practice) for this reason.
Another way to increase import speed is to add more disks by adding more nodes. CrateDB is a distributed database, and so, increasing overall cluster size is generally a good way to improve performance.
Table of contents
Configure your tables¶
Data structure¶
Before starting the import, you need to carefully define your table structure in CrateDB. Decisions made at this point will influence the import process later.
For this example, say you have a single users
table with six columns of
various types. You can create this table with the following statement:
cr> CREATE TABLE users (
... id INT primary key,
... name TEXT,
... day_joined TIMESTAMP,
... bio TEXT INDEX using fulltext,
... address OBJECT (dynamic) AS (
... city TEXT,
... country TEXT
... )
... );
CREATE OK, 1 row affected (... sec)
Refresh interval¶
Another way to speed up importing is to set the refresh_interval of the table to zero:
cr> ALTER TABLE users SET (refresh_interval = 0);
ALTER OK, -1 rows affected (... sec)
This will disable the periodic refresh of the table which will, in turn, will minimize processing overhead during import.
You can also set the refresh interval when initially creating the table, like so:
cr> CREATE TABLE users (
... id INT primary key,
... name TEXT,
... day_joined TIMESTAMP,
... bio TEXT INDEX using fulltext,
... address OBJECT (dynamic) AS (
... city TEXT,
... country TEXT
... )
... ) CLUSTERED INTO 12 shards
... WITH (
... number_of_replicas = 0,
... refresh_interval = 0
... );
CREATE OK, 1 row affected (... sec)
When the import has finished, you can set the refresh interval back to a reasonable value (milliseconds):
cr> ALTER TABLE users SET (refresh_interval = 1000);
ALTER OK, -1 rows affected (... sec)
Import the data¶
Once the table is created, you can start importing the data.
When importing, CrateDB has native support for JSON data. Specifically, for bulk inserts, you can use a format called JSON Lines. In a JSON Lines file, each line is a JSON string representing a single record. Empty lines are skipped. The keys of the JSON objects are mapped to columns when imported. Nonexistent columns will be created if necessary.
For example, a JSON Lines file might look like this:
{"id": 1, "name": "foo", "day_joined": 1408312800, "bio": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit.", "address": {"city": "Dornbirn", "country": "Austria"}}
{"id": 2, "name": "bar", "day_joined": 1408312800, "bio": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit.", "address": {"city": "Berlin", "country": "Germany"}}
Use the COPY FROM statement to import JSON data directly from a file:
cr> COPY users FROM '/tmp/best_practice_data/users.jsonl';
COPY OK, 150 rows affected (... sec)
Here, CrateDB will check /tmp/best_practices_data/
locally on each node in
the cluster to import data from a file called users.jsonl
.
Tip
If you are using Microsoft Windows, you must include the drive letter in the filename.
For example, the above filename should instead be written as
C://tmp/best_practice_data/users.jsonl
.
Consult the Windows documentation for more information.
Caution
If the specified file is not found, CrateDB will still return a successful status, for example:
COPY OK, 0 rows affected (... sec)
Additionally, when importing data using COPY FROM
, CrateDB does not
check whether both the types from the columns and the types from the import
file match. CrateDB does not cast the
imported data types to the the
target column type. Instead, CrateDB will import the data as given in the
source file.
Bulk size¶
You can improve on the example above by configuring the bulk size option, like so:
cr> COPY users FROM '/tmp/best_practice_data/users.jsonl'
... WITH (bulk_size = 2000);
COPY OK, 150 rows affected (... sec)
The bulk_size
option specifies the amount of lines that are read at once
while importing. This option defaults to 10000
.
Tip
In our example use-case, configuring bulk_size
will not make any
practical difference.
However, if you have a more complex dataset with a lot of columns and large
values, it makes sense to decrease the bulk_size
.
A bulk_size
setting that is too high might consume a lot of node
resources. A low bulk_size
can increase the overhead resource
utilization per request.
Compression¶
We recommend that you use gzip to compress your JSON files.
However, CrateDB does not automatically detect file compression, so you will
need to specify gzip
compression, like so:
cr> COPY users FROM '/tmp/best_practice_data/users.jsonl.gz'
... WITH (compression = 'gzip');
COPY OK, 150 rows affected (... sec)
Split your tables into partitions¶
You can split your table into partitions, in order to improve performance.
Partitions can be created using the CREATE TABLE
statement and a
PARTITIONED BY
clause to specify a partition column. For example:
cr> CREATE TABLE users (
... id INT primary key,
... name TEXT,
... day_joined TIMESTAMP primary key,
... bio TEXT INDEX using fulltext,
... address OBJECT (dynamic) AS (
... city TEXT,
... country TEXT
... )
... ) CLUSTERED INTO 6 shards
... PARTITIONED BY (day_joined)
... WITH (number_of_replicas = 0);
CREATE OK, 1 row affected (... sec)
Note
If a primary key was explicitly declared, the partition column has to be a component of the primary key.
A value identifying the target partition column should be defined in the
COPY FROM
statement using the PARTITION
clause:
cr> COPY users PARTITION (day_joined=1408312800)
... FROM '/tmp/best_practice_data/users_1408312800.jsonl';
COPY OK, 23 rows affected (... sec)
When you use the PARTITION
clause, CrateDB doesn’t need to resolve the
partition for each row that is imported, which aids performance.
The corresponding JSON Lines file might look like this:
{"id": 1, "name": "foo", "bio": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit.", "address": {"city": "Dornbirn", "country": "Austria"}}
{"id": 2, "name": "bar", "bio": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit.", "address": {"city": "Berlin", "country": "Germany"}}
Here, notice the partition column itself (day_joined
) has been excluded
from the JSON. You must omit this column if you use the PARTITIONED
clause.
Caution
You can import data into partitioned tables without using the
PARTITION
clause, as long as you do specify the column value in the
source file. However, we do not recommend this method, as doing so may
degrade performance.
Disable table refresh for new shards¶
When importing data into a table that already has partitions, you can optimize the insert operation for newly created shards by disabling the refresh_interval for those partitions (only) using the ALTER TABLE ONLY statement.
See also
Increase the number of shards¶
The ALTER TABLE
clause can also be used to alter the number of shards
for newly created partitions, which may improve performance over the previous
configuration when handling a lot more data than before.
For exmaple:
ALTER TABLE users SET (number_of_shards=12)
Caution
Configuring too many shards may degrade performance. See the prior section about shards for more information.