Data Types

This section of the documentation shows how to work with special data types from the CrateDB SQLAlchemy dialect. Currently, these are:

  • Container types ObjectType and ObjectArray.

  • Geospatial types Geopoint and Geoshape.

  • Vector data type FloatVector.

Table of Contents

Introduction

Import the relevant symbols:

>>> import sqlalchemy as sa
>>> from datetime import datetime
>>> from geojson import Point, Polygon
>>> from sqlalchemy import delete, func, text
>>> from sqlalchemy.orm import sessionmaker
>>> from sqlalchemy.sql import operators
>>> try:
...     from sqlalchemy.orm import declarative_base
... except ImportError:
...     from sqlalchemy.ext.declarative import declarative_base
>>> from uuid import uuid4

Establish a connection to the database, see also Engine Configuration and Connect:

>>> engine = sa.create_engine(f"crate://{crate_host}")
>>> connection = engine.connect()

Create an SQLAlchemy Session:

>>> session = sessionmaker(bind=engine)()
>>> Base = declarative_base()

Introduction to container types

In a document oriented database, it is a common pattern to store objects within a single field. For such cases, the CrateDB SQLAlchemy dialect provides the ObjectType and ObjectArray types.

The ObjectType type effectively implements a dictionary- or map-like type. The ObjectArray type maps to a Python list of dictionaries.

For exercising those features, let’s define a schema using SQLAlchemy’s Declarative Mapping:

>>> from sqlalchemy_cratedb import ObjectType, ObjectArray
>>> def gen_key():
...     return str(uuid4())
>>> class Character(Base):
...     __tablename__ = 'characters'
...     id = sa.Column(sa.String, primary_key=True, default=gen_key)
...     name = sa.Column(sa.String)
...     quote = sa.Column(sa.String)
...     details = sa.Column(ObjectType)
...     more_details = sa.Column(ObjectArray)

In CrateDB’s SQL dialect, those container types map to OBJECT and ARRAY.

ObjectType

Let’s add two records which have additional items within the details field. Note that item keys have not been defined in the DDL schema, effectively demonstrating the DYNAMIC column policy.

>>> arthur = Character(name='Arthur Dent')
>>> arthur.details = {}
>>> arthur.details['gender'] = 'male'
>>> arthur.details['species'] = 'human'
>>> session.add(arthur)
>>> trillian = Character(name='Tricia McMillan')
>>> trillian.details = {}
>>> trillian.quote = "We're on a space ship Arthur. In space."
>>> trillian.details['gender'] = 'female'
>>> trillian.details['species'] = 'human'
>>> trillian.details['female_only_attribute'] = 1
>>> session.add(trillian)
>>> session.commit()

After INSERT statements are submitted to the database, the newly inserted records aren’t immediately available for retrieval because the index is only updated periodically (default: each second). In order to synchronize that, refresh the table:

>>> _ = connection.execute(text("REFRESH TABLE characters"))

A subsequent select query will see all the records:

>>> query = session.query(Character).order_by(Character.name)
>>> [(c.name, c.details['gender']) for c in query]
[('Arthur Dent', 'male'), ('Tricia McMillan', 'female')]

It is also possible to just select a part of the document, even inside the ObjectType type:

>>> sorted(session.query(Character.details['gender']).all())
[('female',), ('male',)]

In addition, filtering on the attributes inside the details column is also possible:

>>> query = session.query(Character.name)
>>> query.filter(Character.details['gender'] == 'male').all()
[('Arthur Dent',)]

Update dictionary

The SQLAlchemy CrateDB dialect supports change tracking deep down the nested levels of a ObjectType type field. For example, the following query will only update the gender key. The species key which is on the same level will be left untouched.

>>> char = session.query(Character).filter_by(name='Arthur Dent').one()
>>> char.details['gender'] = 'manly man'
>>> session.commit()
>>> session.refresh(char)
>>> char.details['gender']
'manly man'
>>> char.details['species']
'human'

Update nested dictionary

>>> char_nested = Character(id='1234id')
>>> char_nested.details = {"name": {"first": "Arthur", "last": "Dent"}}
>>> session.add(char_nested)
>>> session.commit()
>>> char_nested = session.query(Character).filter_by(id='1234id').one()
>>> char_nested.details['name']['first'] = 'Trillian'
>>> char_nested.details['size'] = 45
>>> session.commit()

Refresh and query “characters” table:

>>> _ = connection.execute(text("REFRESH TABLE characters"))
>>> session.refresh(char_nested)
>>> char_nested = session.query(Character).filter_by(id='1234id').one()
>>> pprint(char_nested.details)
{'name': {'first': 'Trillian', 'last': 'Dent'}, 'size': 45}

ObjectArray

Note that opposed to the ObjectType type, the ObjectArray type isn’t smart and doesn’t have intelligent change tracking. Therefore, the generated UPDATE statement will affect the whole list:

>>> char.more_details = [{'foo': 1, 'bar': 10}, {'foo': 2}]
>>> session.commit()
>>> char.more_details.append({'foo': 3})
>>> session.commit()

This will generate an UPDATE statement which looks roughly like this:

"UPDATE characters SET more_details = ? ...", ([{'foo': 1, 'bar': 10}, {'foo': 2}, {'foo': 3}],)

To run queries against fields of ObjectArray types, use the .any(value, operator=operators.eq) method on a subscript, because accessing fields of object arrays (e.g. Character.more_details['foo']) returns an array of the field type.

Only one of the objects inside the array has to match in order for the result to be returned:

>>> query = session.query(Character.name)
>>> query.filter(Character.more_details['foo'].any(1, operator=operators.eq)).all()
[('Arthur Dent',)]

Querying a field of an object array will result in an array of all values of that field of all objects in that object array:

>>> query = session.query(Character.more_details['foo']).order_by(Character.name)
>>> query.all()
[([1, 2, 3],), (None,), (None,)]

Geospatial types

CrateDB’s geospatial types, such as GEO_POINT and GEO_SHAPE, can also be used within an SQLAlchemy declarative schema:

>>> from sqlalchemy_cratedb import Geopoint, Geoshape
>>> class City(Base):
...    __tablename__ = 'cities'
...    name = sa.Column(sa.String, primary_key=True)
...    coordinate = sa.Column(Geopoint)
...    area = sa.Column(Geoshape)

One way of inserting these types is using the geojson library, to create points or shapes:

>>> area = Polygon(
...     [
...         [
...             (139.806, 35.515),
...             (139.919, 35.703),
...             (139.768, 35.817),
...             (139.575, 35.760),
...             (139.584, 35.619),
...             (139.806, 35.515),
...         ]
...     ]
... )
>>> point = Point(coordinates=(139.76, 35.68))

These two objects can then be added to an SQLAlchemy model and added to the session:

>>> tokyo = City(coordinate=point, area=area, name='Tokyo')
>>> session.add(tokyo)
>>> session.commit()
>>> _ = connection.execute(text("REFRESH TABLE cities"))

When reading them back, they are retrieved as the corresponding geojson objects:

>>> query = session.query(City.name, City.coordinate, City.area)
>>> query.all()
 [('Tokyo', (139.75999999791384, 35.67999996710569), {"coordinates": [[[139.806, 35.515], [139.919, 35.703], [139.768, 35.817], [139.575, 35.76], [139.584, 35.619], [139.806, 35.515]]], "type": "Polygon"})]

Vector type

CrateDB’s vector data type, FLOAT_VECTOR, allows to store dense vectors of float values of fixed length.

>>> from sqlalchemy_cratedb import FloatVector, knn_match
>>> class SearchIndex(Base):
...    __tablename__ = 'search'
...    name = sa.Column(sa.String, primary_key=True)
...    embedding = sa.Column(FloatVector(3))

Create an entity and store it into the database. float_vector values can be defined by using arrays of floating point numbers.

>>> foo_item = SearchIndex(name="foo", embedding=[42.42, 43.43, 44.44])
>>> session.add(foo_item)
>>> session.commit()
>>> _ = connection.execute(sa.text("REFRESH TABLE search"))

When reading it back, the FLOAT_VECTOR value will be returned as a NumPy array.

>>> query = session.query(SearchIndex.name, SearchIndex.embedding)
>>> query.all()
[('foo', array([42.42, 43.43, 44.44], dtype=float32))]

In order to apply search, i.e. to match embeddings against each other, use the knn_match(float_vector, float_vector, int) function like this.

>>> query = session.query(SearchIndex.name) \
...                .filter(knn_match(SearchIndex.embedding, [42.42, 43.43, 41.41], 3))
>>> query.all()
[('foo',)]