COPY FROM
¶
You can use the COPY FROM
statement to copy data
from a file into a table.
Table of contents
Synopsis¶
COPY table_identifier
[ ( column_ident [, ...] ) ]
[ PARTITION (partition_column = value [ , ... ]) ]
FROM uri [ WITH ( option = value [, ...] ) ] [ RETURN SUMMARY ]
Description¶
A COPY FROM
copies data from a URI to the specified table.
The nodes in the cluster will attempt to read the files available at the URI and import the data.
Here’s an example:
cr> COPY quotes FROM 'file:///tmp/import_data/quotes.json';
COPY OK, 3 rows affected (... sec)
Note
The COPY
statements use Overload Protection to ensure other
queries can still perform. Please change these settings during large inserts if needed.
File formats¶
CrateDB accepts both JSON and CSV inputs. The format is inferred from the file
extension (.json
or .csv
respectively) if possible. The format can also be set as an option. If a format is not
specified and the format cannot be inferred, the file will be processed as
JSON.
JSON files must contain a single JSON object per line and all files must be UTF-8 encoded. Also, any empty lines are skipped.
Example JSON data:
{"id": 1, "quote": "Don't panic"}
{"id": 2, "quote": "Ford, you're turning into a penguin. Stop it."}
A CSV file may or may not contain a header. See CSV header option for further details.
Example CSV data:
id,quote
1,"Don't panic"
2,"Ford, you're turning into a penguin. Stop it."
Example CSV data with no header:
1,"Don't panic"
2,"Ford, you're turning into a penguin. Stop it."
See also: Importing data.
Data type checks¶
CrateDB checks if the columns’ data types match the types from the import file. It casts the types and will always import the data as in the source file. Furthermore CrateDB will check for all Column constraints.
For example a WKT string cannot be imported into a column of geo_shape
or geo_point
type, since there is no implicit cast to the GeoJSON format.
Note
In case the COPY FROM
statement fails, the log output on the node will
provide an error message. Any data that has been imported until then has
been written to the table and should be deleted before restarting the
import.
Parameters¶
table_ident
The name (optionally schema-qualified) of an existing table where the data should be put.
column_ident
Used in an optional columns declaration, each
column_ident
is the name of a column in thetable_ident
table.This currently only has an effect if using the CSV file format. See the
header
section for how it behaves.
uri
An expression or array of expressions. Each expression must evaluate to a string literal that is a well-formed URI.
URIs must use one of the supported URI schemes. CrateDB supports globbing for the file and s3 URI schemes.
Note
If the URI scheme is missing, CrateDB assumes the value is a pathname and will prepend the file URI scheme (i.e.,
file://
). So, for example, CrateDB will convert/tmp/file.json
tofile:///tmp/file.json
.
URI globbing¶
With file and s3 URI
schemes, you can use pathname globbing (i.e., *
wildcards) with the
COPY FROM
statement to construct URIs that can match multiple directories
and files.
Suppose you used file:///tmp/import_data/*/*.json
as the URI. This URI
would match all JSON files located in subdirectories of the
/tmp/import_data
directory.
So, for example, these files would match:
/tmp/import_data/foo/1.json
/tmp/import_data/bar/2.json
/tmp/import_data/1/boz.json
Caution
A file named /tmp/import_data/foo/.json
would also match the
file:///tmp/import_data/*/*.json
URI. The *
wildcard matches any
number of characters, including none.
However, these files would not match:
/tmp/import_data/1.json
(two few subdirectories)/tmp/import_data/foo/bar/2.json
(too many subdirectories)/tmp/import_data/1/boz.js
(file extension mismatch)
URI schemes¶
CrateDB supports the following URI schemes:
file
¶
You can use the file://
scheme to specify an absolute path to one or more
files accessible via the local filesystem of one or more CrateDB nodes.
For example:
file:///path/to/dir
The files must be accessible on at least one node and the system user running
the crate
process must have read access to every file specified.
Additionally, only the crate
superuser is allowed to use the file://
scheme.
By default, every node will attempt to import every file. If the file is accessible on multiple nodes, you can set the shared option to true in order to avoid importing duplicates.
Use RETURN SUMMARY to get information about what actions were performed on each node.
Tip
If you are running CrateDB inside a container, the file must be inside the container. If you are using Docker, you may have to configure a Docker volume to accomplish this.
Tip
If you are using Microsoft Windows, you must include the drive letter in the file URI.
For example:
file://C:\/tmp/import_data/quotes.json
Consult the Windows documentation for more information.
s3
¶
You can use the s3://
scheme to access buckets on the Amazon Simple
Storage Service (Amazon S3).
For example:
s3://[<accesskey>:<secretkey>@][<host>:<port>/]<bucketname>/<path>
S3 compatible storage providers can be specified by the optional pair of host and port, which defaults to Amazon S3 if not provided.
Here is a more concrete example:
COPY t FROM 's3://accessKey:secretKey@s3.amazonaws.com:443/myBucket/key/a.json' with (protocol = 'https')
If no credentials are set the s3 client will operate in anonymous mode. See AWS Java Documentation.
Using the s3://
scheme automatically sets the
shared to true.
Tip
A secretkey
provided by Amazon Web Services can contain characters such
as ‘/’, ‘+’ or ‘=’. These characters must be URL encoded. For a detailed
explanation read the official AWS documentation.
To escape a secret key, you can use a snippet like this:
sh$ python -c "from getpass import getpass; from urllib.parse import quote_plus; print(quote_plus(getpass('secret_key: ')))"
This will prompt for the secret key and print the encoded variant.
Additionally, versions prior to 0.51.x use HTTP for connections to S3. Since
0.51.x these connections are using the HTTPS protocol. Please make sure you
update your firewall rules to allow outgoing connections on port 443
.
az
¶
You can use the az://
scheme to access files on the Azure Blob Storage.
URI must look like az:://<account>.<endpoint_suffix>/<container>/<blob_path>
.
For example:
az://myaccount.blob.core.windows.net/my-container/dir1/dir2/file1.json
One of the authentication parameters (key or sas_token)
must be provided in the WITH
clause.
Protocol can be provided in the WITH
clause, otherwise https
is used by default.
For example:
COPY source
TO DIRECTORY 'az://myaccount.blob.core.windows.net/my-container/dir1/dir2/file1.json'
WITH (
key = 'key'
)
Using the az://
scheme automatically sets the
shared to true
.
Other schemes¶
In addition to the schemes above, CrateDB supports all protocols supported by
the URL implementation of its JVM (typically http
, https
, ftp
,
and jar
). Please refer to the documentation of the JVM vendor for an
accurate list of supported protocols.
Note
These schemes do not support wildcard expansion.
Clauses¶
The COPY FROM
statement supports the following
clauses:
PARTITION
¶
If the table is partitioned, the optional
PARTITION
clause can be used to import data into one partition exclusively.
[ PARTITION ( partition_column = value [ , ... ] ) ]
- partition_column:
One of the column names used for table partitioning
- value:
The respective column value.
All partition columns (specified by the
PARTITIONED BY clause) must be listed inside the
parentheses along with their respective values using the partition_column =
value
syntax (separated by commas).
Because each partition corresponds to a unique set of partition column row values, this clause uniquely identifies a single partition for import.
Tip
The SHOW CREATE TABLE statement will show you the complete list of partition columns specified by the PARTITIONED BY clause.
Caution
Partitioned tables do not store the row values for the partition columns, hence every row will be imported into the specified partition regardless of partition column values.
WITH
¶
You can use the optional WITH
clause to specify option values.
[ WITH ( option = value [, ...] ) ]
The WITH
clause supports the following options:
- bulk_size
- Type:
integer
Default:10000
OptionalCrateDB will process the lines it reads from the
path
in bulks. This option specifies the size of one batch. The provided value must be greater than 0.
- fail_fast
- Type:
boolean
Default:false
OptionalA boolean value indicating if the
COPY FROM
operation should abort early after an error. This is best effort and due to the distributed execution, it may continue processing some records before it aborts.
- wait_for_completion
- Type:
boolean
Default:true
OptionalA boolean value indicating if the
COPY FROM
should wait for the copy operation to complete. If set tofalse
the request returns at once and the copy operation runs in the background.
- node_filters
- Type:
text
OptionalA filter expression to select the nodes to run the read operation.
It’s an object in the form of:
{ name = '<node_name_regex>', id = '<node_id_regex>' }
Only one of the keys is required.
The
name
regular expression is applied on thename
of all execution nodes, whereas theid
regex is applied on thenode id
.If both keys are set, both regular expressions have to match for a node to be included.
If the shared option is false, a strict node filter might exclude nodes with access to the data leading to a partial import.
To verify which nodes match the filter, run the statement with EXPLAIN.
- num_readers
- Type:
integer
Default: Number of nodes available in the cluster.OptionalThe number of nodes that will read the resources specified in the URI. If the option is set to a number greater than the number of available nodes it will still use each node only once to do the import. However, the value must be an integer greater than 0.
If shared is set to false this option has to be used with caution. It might exclude the wrong nodes, causing COPY FROM to read no files or only a subset of the files.
- compression
- Type:
text
Values:gzip
Default: By default the output is not compressed.OptionalDefine if and how the exported data should be compressed.
- protocol
- Type:
text
Values:http
,https
Default:https
Optional
- overwrite_duplicates
- Type:
boolean
Default:false
OptionalCOPY FROM
by default won’t overwrite rows if a document with the same primary key already exists. Set to true to overwrite duplicate rows.
- empty_string_as_null
- Type:
boolean
Default:false
OptionalIf set to
true
theempty_string_as_null
option enables conversion of empty strings intoNULL
.The option is only supported when using the
CSV
format, otherwise, it will be ignored.
- delimiter
- Type:
text
Default:,
OptionalSpecifies a single one-byte character that separates columns within each line of the file.
The option is only supported when using the
CSV
format, otherwise, it will be ignored.
- format
- Type:
text
Values:csv
,json
Default:json
OptionalThis option specifies the format of the input file. Available formats are
csv
orjson
. If a format is not specified and the format cannot be guessed from the file extension, the file will be processed as JSON.
- header
- Type:
boolean
Default:true
OptionalUsed to indicate if the first line of a CSV file contains a header with the column names.
If set to
false
, the CSV must not contain column names in the first line and instead the columns declared in the statement are used. If no columns are declared in the statement, it will default to all columns present in the table in theirCREATE TABLE
declaration order.If set to
true
the first line in the CSV file must contain the column names. You can use the optional column declaration in addition to import only a subset of the data.If the statement contains no column declarations, all fields in the CSV are read and if it contains fields where there is no matching column in the table, the behavior depends on the
column_policy
table setting. Ifdynamic
it implicitly adds new columns, ifstrict
the operation will fail.An example of using input file with no header
cr> COPY quotes FROM 'file:///tmp/import_data/quotes.csv' with (format='csv', header=false); COPY OK, 3 rows affected (... sec)
- skip
- Type:
integer
Default:0
OptionalSetting this option to
n
skips the firstn
rows while copying.Note
CrateDB by default expects a header in CSV files. If you’re using the SKIP option to skip the header, you have to set
header = false
as well. See header.
- key
- Type:
text
OptionalUsed for az scheme only. The Azure Storage Account Key.
Note
It must be provided if sas_token is not provided.
- sas_token
- Type:
text
OptionalUsed for az scheme only. The Shared Access Signatures (SAS) token used for authentication for the Azure Storage account. This can be used as an alternative to the The Azure Storage Account Key.
The SAS token must have read, write, and list permissions for the container base path and all its contents. These permissions need to be granted for the blob service and apply to resource types service, container, and object.
Note
It must be provided if key is not provided.
RETURN SUMMARY
¶
By using the optional RETURN SUMMARY
clause, a per-node result set will be
returned containing information about possible failures and successfully
inserted records.
[ RETURN SUMMARY ]
Column Name |
Description |
Return Type |
---|---|---|
|
Information about the node that has processed the URI resource. |
|
|
The id of the node. |
|
|
The name of the node. |
|
|
The URI the node has processed. |
|
|
The total number of records which failed.
A NULL value indicates a general URI reading
error, the error will be listed inside the
|
|
|
The total number of records which were
inserted.
A NULL value indicates a general URI reading
error, the error will be listed inside the
|
|
|
Contains detailed information about all errors. Limited to at most 25 error messages. |
|
|
Contains information about a type of an error. |
|
|
The number records failed with this error. |
|
|
The line numbers of the source URI where the error occurred, limited to the first 50 errors, to avoid buffer pressure on clients. |
|