Feedback
3.3
COPY FROM
¶
Copy data from files into a table.
Table of Contents
Synopsis¶
COPY table_ident [ PARTITION (partition_column = value [ , ... ]) ]
FROM uri [ WITH ( option = value [, ...] ) ] [ RETURN SUMMARY ]
where option
can be one of:
bulk_size
integershared
booleannum_readers
integercompression
stringoverwrite_duplicates
boolean
Description¶
COPY FROM
copies data from a URI to the specified table as a raw data
import.
The nodes in the cluster will attempt to read the files available at the URI and import the data.
Supported Formats¶
CrateDB accepts both JSON and CSV inputs. The format is inferred from the file
extension (.json
or .csv
respectively) if possible. The format can also
be provided as an option (see WITH). If a format is not specified
and the format cannot be inferred, the file will be processed as JSON.
Files must be UTF-8 encoded. Any keys in the object will be added as columns, regardless of the previously defined table. Empty lines are skipped.
JSON files must contain a single JSON object per line.
Example JSON data:
{"id": 1, "quote": "Don't panic"}
{"id": 2, "quote": "Ford, you're turning into a penguin. Stop it."}
CSV files must contain a header with comma-separated values, which will be added as columns.
Example CSV data:
id,quote
1,"Don't panic"
2,"Ford, you're turning into a penguin. Stop it."
See also: Import and Export.
Type Casts and Constraints¶
CrateDB does not check if the column’s data types match the types from the
import file. It does not cast the types but will always import the data as in
the source file. Furthermore CrateDB will only check for primary key duplicates
but not for other Column Constraints like NOT NULL
.
For example a WKT string cannot be imported into a column of geo_shape
or geo_point
type, since there is no implict cast to the GeoJSON format.
Note
In case the COPY FROM
statement fails, the log output on the node will
provide an error message. Any data that has been imported until then has
been written to the table and should be deleted before restarting the
import.
URI¶
A string literal or array of string literals containing URIs. Each URI must be formatted according to the URI Scheme.
Note
If you are using Microsoft Windows, you must include the drive letter in the file URI. Consult the Windows documentation for more information.
In case the URI scheme is missing the value is assumed to be a file path and
will be converted to a file://
URI implicitly.
For example:
/tmp folder/file.json
Is converted to:
'file:///tmp%20folder/file.json'
Supported Schemes¶
file
¶
The provided (absolute) path should point to files available on at least one node with read access to the CrateDB process (with its default user ‘crate’) there.
By default each node will attempt to read the files specified. In case the URI
points to a shared folder (where other CrateDB nodes also have access) the
shared
option must be set to true in order to avoid importing duplicates.
s3
¶
Can be used to access buckets on the Amazon AWS S3 Service:
s3://[<accesskey>:<secretkey>@]<bucketname>/<path>
If no credentials are set the s3 client will operate in anonymous mode, see AWS Java Documentation.
Using the ‘s3://’ URI scheme sets the shared
option implicitly.
Note
A secretkey
provided by Amazon Web Services can contain characters such
as ‘/’, ‘+’ or ‘=’. These characters must be URL encoded. For a detailed
explanation read the official AWS documentation.
Additionally, versions prior to 0.51.x use HTTP for connections to S3. Since
0.51.x these connections are using the HTTPS protocol. Please make sure you
update your firewall rules to allow outgoing connections on port 443
.
http
, https
, and jar
(Java URL protocols)¶
In addition to the schemes above, CrateDB supports all protocols supported by
the URL implementation of its JVM (typically http
, https
, ftp
,
and jar
). Please refer to the documentation of the JVM vendor for an
accurate list of supported protocols.
These schemes do not support wildcard expansion.
Parameters¶
- table_ident
The name (optionally schema-qualified) of an existing table where the data should be put.
- uri
An expression which evaluates to a URI as defined in RFC2396. The supported schemes are listed above. The last part of the path may also contain
*
wildcards to match multiple files.
Clauses¶
PARTITION
¶
For partitioned tables this clause can be used to import data into the specified partition. This clause takes one or more partition columns and for each column a value.
[ PARTITION ( partition_column = value [ , ... ] ) ]
- partition_column
The name of the column by which the table is partitioned. All partition columns that were part of the PARTITIONED BY of the CREATE TABLE statement must be specified.
- value
The column’s value.
Note
Partitioned tables do not store the value for the partition column in each row, hence every row will be imported into the specified partition regardless of the value provided for the partition columns.
WITH
¶
The optional WITH
clause can specify options for the COPY FROM statement.
[ WITH ( option = value [, ...] ) ]
Options¶
bulk_size
¶
CrateDB will process the lines it reads from the path
in bulks. This option
specifies the size of one batch. The provided value must be greater than 0, the
default value is 10000.
node_filters
¶
A filter expression to select the nodes to run the read operation.
It’s an object in the form of:
{
name = '<node_name_regex>',
id = '<node_id_regex>'
}
Only one of the keys is required.
The name
regular expression is applied on the name
of all execution
nodes, whereas the id
regex is applied on the node id
.
If both keys are set, both regular expressions have to match for a node to be included.
If the shared
option if false, a strict node filter might exclude nodes
with access to the data leading to a partial import.
To verify which nodes match the filter, run the statement with EXPLAIN.
num_readers
¶
The number of nodes that will read the resources specified in the URI. Defaults to the number of nodes available in the cluster. If the option is set to a number greater than the number of available nodes it will still use each node only once to do the import. However, the value must be an integer greater than 0.
If shared
is set to false this option has to be used with caution. It might
exclude the wrong nodes, causing COPY FROM to read no files or only a subset of
the files.
compression
¶
The default value is null
, set to gzip
to read gzipped files.
overwrite_duplicates
¶
Default: false
COPY FROM
by default won’t overwrite rows if a document with the same
primary key already exists. Set to true to overwrite duplicate rows.
RETURN SUMMARY
¶
By using the optional RETURN SUMMARY
clause, a per-node result set will be
returned containing information about possible failures and successfully
inserted records.
[ RETURN SUMMARY ]
Schema
Column Name |
Description |
Return Type |
---|---|---|
|
Information about the node that has processed the URI resource. |
|
|
The id of the node. |
|
|
The name of the node. |
|
|
The URI the node has processed. |
|
|
The total number of records which failed.
A NULL value indicates a general URI reading
error, the error will be listed inside the
|
|
|
The total number of records which were
inserted.
A NULL value indicates a general URI reading
error, the error will be listed inside the
|
|
|
Contains detailed information about all errors. |
|
|
Contains information about a type of an error. |
|
|
The number records failed with this error. |
|
|
The line numbers of the source URI where the error occurred. |
|