The CrateDB Python client library provides full access to the powerful blob storage capabilities of your CrateDB cluster.

Table of contents

Get a blob container

The first thing you will need to do is connect to CrateDB. Follow the instructions in the connection document for more detailed information.

For the sake of this example, we will do the following:

>>> from crate import client
>>> connection = client.connect("http://localhost:4200/")

This is a simple connection that connects to a CrateDB node running on the local host with the HTTP endpoint listening on port 4200 (the default).

To work with blobs in CrateDB, you must specifically create blob tables.

The CrateDB Python client allows you to interact with these blob tables via a blob container, which you can create like this:

>>> blob_container = connection.get_blob_container('my_blobs')
>>> blob_container
<BlobContainer 'my_blobs'>

Here, we have created a BlobContainer for the my_blobs table, using connection object.

Now we can start working with our blob container.

Working with the blob container

Upload blobs

The blob container can work with files or file-like objects, as long as produce bytes when read.

What is a file-like object? Well, to put it simply, any object that provides a read() method.

The stream objects provided by the Python standard library io and tempfile modules are the most commonly used file-like objects.

The io.StringIO class is not suitable, as it produces Unicode strings when read. But you can easily encode a Unicode string and feed it to a io.BytesIO object.

Here’s a trivial example:

>>> import io
>>> bytestream = "An example sentence.".encode("utf8")
>>> file = io.BytesIO(bytestream)

This file can then be uploaded to the blob table using the put method:

>>> blob_container.put(file)

Notice that this method computes and returns an SHA-1 digest. This is necessary for attempting to save the blob to CrateDB.

If you already have the SHA-1 digest computed, or are able to compute it as part of an existing read, this may improve the performance of your application.

If you pass in a SHA-1 digest, it will not be recomputed:

>>> # seek to the beginning before attempting to re-upload
>>> digest = "6f10281ad07d4a35c6ec2f993e6376032b77181d"
>>> blob_container.put(file, digest=digest)

Notice that the method returned False this time. If you specify a digest, the return value of the put method is a boolean indicating whether the object was written or not. In this instance, it was not written, because the digest is the same as an existing object.

Let’s make a new object:

>>> bytestream = "Another example sentence.".encode("utf8")
>>> digest = hashlib.sha1(bytestream).hexdigest()
>>> another_file = io.BytesIO(bytestream)

And upload it:

>>> blob_container.put(another_file, digest)

The put method returns True, indicating that the object has been written to the blob container.

Retrieve blobs

To retrieve a blob, you need to know its digest.

Let’s use the digest variable we created before to check whether that object exists with the exists method:

>>> blob_container.exists(digest)

This method returns a boolean value. And in this instance, True indicates that the blob we’re interested in is contained within the blob container.

You can get the blob, with the get method, like so:

>>> blob_generator = blob_container.get(digest)

Blobs are read in chunks. The default size of these chunks is 128 kilobytes, but this can be changed by supplying the desired chunk size to the get method, like so:

>>> res = blob_container.get(digest, 1024 * 128)

The blob object is a Python generator, meaning that you can call next(blob) for each new chunk you want to read, until you encounter a StopIteration exception.

Instead of calling next() manually, the idiomatic way to iterate over a generator is like so:

>>> blob_content = b''
>>> for chunk in blob_container.get(digest):
...     blob_content += chunk

Delete blobs

You can delete a blob with the delete method and the blob digest, like so:

>>> blob_container.delete(digest)

This method returns a boolean status value. In this instance, True indicates that the blob was deleted.

We can verify that, like so:

>>> blob_container.exists(digest)