Insert Methods¶
CrateDB supports multiple ways to insert data.
Some insert methods can be faster than others, depending on your setup. Choosing the best insert method is an easy way to improve insert performance.
Table of contents
Statement types¶
The three types of insert statements are:
Each type of statement can be issued using any of the three types of client approaches, which we cover in the next section.
Single inserts¶
Single inserts are the most basic sort of insert statement, and look like this:
cr> INSERT INTO my_table (column_a) VALUES ("value 1");
INSERT OK, 1 row affected (... sec)
Single inserts are typically very fast with CrateDB. A small cluster can easily handle several thousand inserts per second.
However, it’s important to note that:
Every insert is first applied to the primary shard
After the primary shard has been updated, the insert is then individually communicated in parallel to every configured replica shard
CrateDB will not return a response until all replica shards have been updated
The overhead for each insert (query parsing, planning, and execution, and potentially network connection overhead too) starts to add up for very heavy insert workloads.
In addition, lots of internal traffic will congest your network, which will slow down any network-based cluster operations (i.e. inserts, distributed reads, replication, and cluster management).
UNNEST
¶
The UNNEST function produce rows, like so:
cr> SELECT *
... FROM UNNEST(
... [1, 2, 3],
... ['Arthur', 'Trillian', 'Marvin']);
+------+----------+
| col1 | col2 |
+------+----------+
| 1 | Arthur |
| 2 | Trillian |
| 3 | Marvin |
+------+----------+
SELECT 3 rows in set (... sec)
Combine UNNEST
with INSERT
to insert multiple rows at once:
cr> INSERT INTO my_table (id, name)
... (SELECT *
... FROM UNNEST(
... [1, 2, 3],
... ['Arthur', 'Trillian', 'Marvin']));
INSERT OK, 3 rows affected (... sec)
You should see a dramatic improvement in performance over single inserts.
Specifically, the advantages are:
Significantly less internal network traffic
The query only needs to be parsed, planned, and executed once
If translog.durability is set to
REQUEST
(the default), an insert usingUNNEST
flushes the disk once for every shard written to
If your client supports query string parameter substitution, you can use the
UNNEST
method with static prepared statements.
For example, using the CrateDB Python client, the following is possible:
client.execute("""
INSERT INTO my_table (id, name)
(SELECT *
FROM UNNEST(?, ?))
""", ([1, 2, 3], ["Arthur", "Trillian", "Marvin"]))
Here, you can vary the number of rows being inserted without having to change the prepared statement.
Warning
When inserting using UNNEST
, CrateDB may drop rows that produce errors
without returning an error message. This happens when the SELECT
using
UNNEST
affects rows with invalid column names, or with data types that
are not internally consistent. This behavior can produce inconsistencies
and unexpected results. Refer to the UNNEST reference documentation for
more detail.
Multiple value expressions¶
You can insert multiple rows with multiple value expressions, like so:
cr> INSERT INTO my_table (id, name)
... VALUES (1, 'Arthur'),
... (2, 'Trillian'),
... (2, 'Marvin');
INSERT OK, 3 rows affected (... sec)
This method of doing bulk inserts is usually slower than the UNNEST
method,
because parsing is more expensive. The query looks nicer for humans though.
The only problem is that the structure of the insert statement is variable on the number of rows to insert. If you are inserting a variable number of rows, you have to prepare the SQL statement using some form of string concatenation each time.
Query string parameter substitution is recommended over string concatenation,
and so the UNNEST
method is recommended over the multiple value expressions
method.
Client approaches¶
The three client approaches for doing inserts are:
Each client approach can be used to insert any type of insert statement.
Standard querying¶
The standard way of issuing insert statements executes one statement at a time and does not make use of Bulk operations or any special Prepared statements client feature.
For example, using the CrateDB Python client, here’s a single insert:
client.execute("INSERT INTO my_table (column_a) VALUES (?)", ["value 1"])
Bulk operations¶
You can use the bulk operations feature of the CrateDB HTTP endpoint to perform many inserts in a single operation.
The advantages are the same as using the UNNEST method:
Significantly less internal network traffic than executing each insert statement individually
Even though you’re executing multiple insert statements, the bulk query only needs to be parsed, planned, and executed once
If translog.durability is set to
REQUEST
(the default), a bulk insert only flushes the disk once for every shard written to
Because the advantages are the same as using the UNNEST
method, you
typically will not see a performance improvement by combining bulk operations
with UNNEST
statements or statements with multiple value expressions.
Bulk operations are typically done with single insert statements as an alternative to the UNNEST
method.
See also
Prepared statements¶
Some clients offer a prepared statements feature. Prepared statements are parsed by CrateDB and can then be executed any number of times without having to re-parse.
This functionality is often presented as batch execution. The JDBC client, for instance, provides the addBatch and executeBatch methods.
For example:
PreparedStatement preparedStatement = connection.prepareStatement(
"INSERT INTO my_table (id, first_name) VALUES (?, ?)");
preparedStatement.setString(1, "Arthur");
preparedStatement.addBatch();
preparedStatement.setString(1, "Trillian");
preparedStatement.addBatch();
preparedStatement.setString(1, "Marvin");
preparedStatement.addBatch();
int[] results = preparedStatement.executeBatch();
In addition to reducing parsing overhead, prepared statement execution requests use the binary protocol, contain almost no headers, and are executed over an already established connection.
Typically, prepared statements are used single insert statements.
Prepared statements with single inserts will usually perform better than standard querying with single inserts, and should be comparable to standard querying with both the UNNEST method and multiple value expressions.
Testing¶
Follow the basic inserts performance testing procedure.
To test bulk operations, you should:
Configure the setup you would like to test
Run a number of different tests against that setup, using different
--bulk-size
settingsEvaluate your throughput results (perhaps by plotting your results on a graph so that you can see the response curve)
Try out different setups and re-run the test.
Please note that INSERT INTO
statements using a query, and the COPY FROM
statement, are using overload protection to ensure performance of other queries
in parallel. Refer to the Overload Protection
documentation on how to modify these parameters.
At the end of this process, you will have a better understanding of the throughput of your cluster with different setups and under different loads.