Features Overview¶
Table of contents
Introduction¶
SQLAlchemy is the most popular Object-Relational Mapping (ORM) library for Python.
The sqlalchemy-cratedb
package provides the SQLAlchemy dialect for CrateDB. It is registered at
installation time, and can be used without further configuration.
The CrateDB SQLAlchemy dialect is validated to work with SQLAlchemy versions
1.3
, 1.4
, and 2.0
.
See also
For general help using SQLAlchemy, consult the SQLAlchemy tutorial or the SQLAlchemy library.
Supplementary information about the CrateDB SQLAlchemy dialect can be found in the data types appendix.
Code examples for using the CrateDB SQLAlchemy dialect can be found at Examples.
Connecting¶
Database URLs¶
In an SQLAlchemy context, database addresses are represented by Uniform Resource Locators (URL) called Database URLs.
The simplest database URL for CrateDB looks like this:
crate://<HOST>/[?option=value]
Here, <HOST>
is the node host string. After the host, additional query
parameters can be specified to adjust some connection settings.
A host string looks like this:
[<USERNAME>:<PASSWORD>@]<HOST_ADDR>:<PORT>
Here, <HOST_ADDR>
is the hostname or IP address of the CrateDB node and
<PORT>
is a valid psql.port number.
When authentication is needed, the credentials can be optionally supplied using
<USERNAME>:<PASSWORD>@
. For connecting to an SSL-secured HTTP endpoint, you
can add the query parameter ?ssl=true
to the database URI.
Example database URIs:
crate://localhost:4200
crate://crate-1.vm.example.com:4200
crate://username:password@crate-2.vm.example.com:4200/?ssl=true
crate://198.51.100.1:4200
Tip
If <HOST>
is blank (i.e. the database URI is just crate://
), then
localhost:4200
will be assumed.
Getting a connection¶
Create an engine¶
You can connect to CrateDB using the create_engine
method. This method
takes a database URL.
Import the sa
module, like so:
>>> import sqlalchemy as sa
To connect to localhost:4200
, you can do this:
>>> engine = sa.create_engine('crate://')
To connect to crate-1.vm.example.com:4200
, you would do this:
>>> engine = sa.create_engine('crate://crate-1.vm.example.com:4200')
If your CrateDB cluster has multiple nodes, however, we recommend that you
configure all of them. You can do that by specifying the crate://
database
URL and passing in a list of host strings passed using
the connect_args
argument, like so:
>>> engine = sa.create_engine('crate://', connect_args={
... 'servers': ['198.51.100.1:4200', '198.51.100.2:4200']
... })
When you do this, the Database API layer will use its round-robin implementation.
The client validates SSL server certificates
by default. For further adjusting this behaviour, SSL verification options can
be passed in by using the connect_args
dictionary.
For example, use ca_cert
for providing a path to the CA certificate used
for signing the server certificate:
>>> engine = sa.create_engine(
... 'crate://',
... connect_args={
... 'servers': ['198.51.100.1:4200', '198.51.100.2:4200'],
... 'ca_cert': '<PATH_TO_CA_CERT>',
... }
... )
In order to disable SSL verification, use verify_ssl_cert = False
, like:
>>> engine = sa.create_engine(
... 'crate://',
... connect_args={
... 'servers': ['198.51.100.1:4200', '198.51.100.2:4200'],
... 'verify_ssl_cert': False,
... }
... )
Get a session¶
Once you have an CrateDB engine
set up, you can create and use an SQLAlchemy
Session
object to execute queries:
>>> from sqlalchemy.orm import sessionmaker
>>> Session = sessionmaker(bind=engine)
>>> session = Session()
See also
SQLAlchemy has more documentation about this topic on Session Basics.
Connecting to CrateDB Cloud¶
Connecting to CrateDB Cloud works like this. Please note the ?ssl=true
query parameter at the end of the database URI.
>>> import sqlalchemy as sa
>>> dburi = "crate://admin:<PASSWORD>@example.aks1.westeurope.azure.cratedb.net:4200?ssl=true"
>>> engine = sa.create_engine(dburi, echo=True)
Tables¶
Table definition¶
Here is an example SQLAlchemy table definition using the declarative system:
>>> from sqlalchemy.ext import declarative
>>> from uuid import uuid4
>>> from sqlalchemy_cratedb import ObjectType, ObjectArray
>>> def gen_key():
... return str(uuid4())
>>> Base = declarative.declarative_base(bind=engine)
>>> class Character(Base):
...
... __tablename__ = 'characters'
... __table_args__ = {
... 'crate_number_of_shards': 3
... }
...
... id = sa.Column(sa.String, primary_key=True, default=gen_key)
... name = sa.Column(sa.String, crate_index=False)
... name_normalized = sa.Column(sa.String, sa.Computed("lower(name)"))
... quote = sa.Column(sa.String, nullable=False)
... details = sa.Column(ObjectType)
... more_details = sa.Column(ObjectArray)
... name_ft = sa.Column(sa.String)
... quote_ft = sa.Column(sa.String)
... even_more_details = sa.Column(sa.String, crate_columnstore=False)
... created_at = sa.Column(sa.DateTime, server_default=sa.func.now())
...
... __mapper_args__ = {
... 'exclude_properties': ['name_ft', 'quote_ft']
... }
In this example, we:
Define a
gen_key
function that producesUUIDs
Set up a
Base
class for the tableCreate the
Characters
class for thecharacters
tableUse the
gen_key
function to provide a default value for theid
column (which is also the primary key)Use standard SQLAlchemy types for the
id
,name
, andquote
columnsUse
nullable=False
to define aNOT NULL
constraintDisable indexing of the
name
column usingcrate_index=False
Define a computed column
name_normalized
(based onname
) that translates into a generated columnUse the ObjectType extension type for the
details
columnUse the ObjectArray extension type for the
more_details
columnSet up the
name_ft
andquote_ft
fulltext indexes, but exclude them from the mapping (so SQLAlchemy doesn’t try to update them as if they were columns)Disable the columnstore of the
even_more_details
column usingcrate_columnstore=False
Add a
created_at
column whose default value is set by CrateDB’snow()
function.
Tip
This example table is used throughout the rest of this document.
See also
The SQLAlchemy documentation has more information about Describing Databases with MetaData.
Additional __table_args__
¶
The example also shows the optional usage of __table_args__
to configure
table-wide attributes. The following attributes can optionally be configured:
crate_number_of_shards
: The number of primary shards the table will be split intocrate_clustered_by
: The routing column to use for shardingcrate_number_of_replicas
: The number of replicas to allocate for each primary shardcrate_partitioned_by
: One or more columns to use as a partition key
See also
The CREATE TABLE documentation contains more information on each of the attributes.
_id
as primary key¶
As with version 4.2 CrateDB supports the RETURNING
clause, which makes it
possible to use the _id
column as fetched value for the PRIMARY KEY
constraint, since the SQLAlchemy ORM always requires a primary key.
A table schema like this
CREATE TABLE "doc"."logs" (
"ts" TIMESTAMP WITH TIME ZONE NOT NULL,
"level" TEXT,
"message" TEXT
)
would translate into the following declarative model:
>>> from sqlalchemy.schema import FetchedValue
>>> class Log(Base):
...
... __tablename__ = 'logs'
... __mapper_args__ = {
... 'exclude_properties': ['id']
... }
...
... id = sa.Column("_id", sa.String, server_default=FetchedValue(), primary_key=True)
... ts = sa.Column(sa.DateTime, server_default=sa.func.current_timestamp())
... level = sa.Column(sa.String)
... message = sa.Column(sa.String)
>>> log = Log(level="info", message="Hello World")
>>> session.add(log)
>>> session.commit()
>>> log.id
...
Auto-generated identifiers¶
CrateDB does not provide traditional sequences or SERIAL
data type support,
which enable automatically assigning incremental values when inserting records.
However, it offers server-side support by providing an SQL function to generate
random identifiers of STRING
type, and client-side support for generating
INTEGER
-based identifiers, when using the SQLAlchemy dialect.
gen_random_text_uuid
¶
CrateDB 4.5.0 added the gen_random_text_uuid() scalar function, which can also be used within an SQL DDL statement, in order to automatically assign random identifiers to newly inserted records on the server side.
In this spirit, it is suitable to be used as a PRIMARY KEY
constraint for SQLAlchemy.
It works on SQLAlchemy-defined columns of type sa.String
.
A table schema like this
CREATE TABLE "doc"."items" (
"id" STRING DEFAULT gen_random_text_uuid() NOT NULL PRIMARY KEY,
"name" STRING
)
would translate into the following declarative model:
>>> class Item(Base):
...
... __tablename__ = 'items'
...
... id = sa.Column("id", sa.String, server_default=func.gen_random_text_uuid(), primary_key=True)
... name = sa.Column("name", sa.String)
>>> item = Item(name="Foobar")
>>> session.add(item)
>>> session.commit()
>>> item.id
...
Timestamp-based Autoincrement¶
By using SQLAlchemy’s sa.func.now()
, you can assign automatically generated
identifiers to SQLAlchemy columns of types sa.BigInteger
, sa.DateTime
,
and sa.String
.
This emulates autoincrement / sequential ID behavior for designated columns, based on assigning timestamps on record insertion.
>>> class Item(Base):
... id = sa.Column("id", sa.BigInteger, default=func.now(), primary_key=True)
... name = sa.Column("name", sa.String)
>>> item = Item(name="Foobar")
>>> session.add(item)
>>> session.commit()
>>> item.id
...
There is a support utility which emulates autoincrement / sequential ID behavior for designated columns, based on assigning timestamps on record insertion. See Synthetic Autoincrement using Timestamps.
Extension types¶
In the example SQLAlchemy table definition above, we are making use of the two extension data types that the CrateDB SQLAlchemy dialect provides.
See also
The appendix has a full data types reference.
ObjectType
¶
Objects are a common, and useful, data type when using CrateDB, so the CrateDB
SQLAlchemy dialect provides a custom Object
type extension for working with
these values.
Here’s how you use the SQLAlchemy Session to insert two records:
>>> # use the crate engine from earlier examples
>>> Session = sessionmaker(bind=crate)
>>> session = Session()
>>> arthur = Character(name='Arthur Dent')
>>> arthur.details = {}
>>> arthur.details['gender'] = 'male'
>>> arthur.details['species'] = 'human'
>>> session.add(arthur)
>>> trillian = Character(name='Tricia McMillan')
>>> trillian.details = {}
>>> trillian.quote = "We're on a space ship Arthur. In space."
>>> trillian.details['gender'] = 'female'
>>> trillian.details['species'] = 'human'
>>> trillian.details['female_only_attribute'] = 1
>>> session.add(trillian)
>>> session.commit()
Note
The information we supply via the details
column isn’t defined in the
original SQLAlchemy table definition schema.
These details can be specified as object column policy when you create
the column in CrateDB, you can either use the STRICT column policy, or the DYNAMIC column
policy.
Note
Behind the scenes, if you update an ObjectType
property, and commit
that
change, the UPDATE statement sent
to CrateDB will only include the data necessary to update the changed
sub-columns.
ObjectArray
¶
In addition to the ObjectType type, the CrateDB SQLAlchemy dialect also provides
an ObjectArray
type, which is structured as a list
of
dictionaries
.
Here’s how you might set the value of an ObjectArray
column:
>>> arthur.more_details = [{'foo': 1, 'bar': 10}, {'foo': 2}]
>>> session.commit()
If you append an object, like this:
>>> arthur.more_details.append({'foo': 3})
>>> session.commit()
The resulting object will look like this:
>>> arthur.more_details
[{'foo': 1, 'bar': 10}, {'foo': 2}, {'foo': 3}]
Caution
Behind the scenes, if you update an ObjectArray
, and commit
that
change, the UPDATE statement
sent to CrateDB will include all of the ObjectArray
data.
Geopoint
and Geoshape
¶
The CrateDB SQLAlchemy dialect provides two geospatial types:
Geopoint
, which represents a longitude and latitude coordinateGeoshape
, which is used to store geometric GeoJSON geometry objects
To use these types, you can create columns, like so:
>>> from sqlalchemy_cratedb import Geopoint, Geoshape
>>> class City(Base):
...
... __tablename__ = 'cities'
... name = sa.Column(sa.String, primary_key=True)
... coordinate = sa.Column(Geopoint)
... area = sa.Column(Geoshape)
A geopoint can be created in multiple ways. Firstly, you can define it as a
tuple
of (longitude, latitude)
:
>>> point = (139.76, 35.68)
Secondly, you can define it as a geojson Point
object:
>>> from geojson import Point
>>> point = Point(coordinates=(139.76, 35.68))
To create a geoshape, you can use a geojson shape object, such as a Polygon
:
>>> from geojson import Point, Polygon
>>> area = Polygon(
... [
... [
... (139.806, 35.515),
... (139.919, 35.703),
... (139.768, 35.817),
... (139.575, 35.760),
... (139.584, 35.619),
... (139.806, 35.515),
... ]
... ]
... )
You can then set the values of the Geopoint
and Geoshape
columns:
>>> tokyo = City(name="Tokyo", coordinate=point, area=area)
>>> session.add(tokyo)
>>> session.commit()
Querying¶
When the commit
method is called, two INSERT
statements are sent to
CrateDB. However, the newly inserted rows aren’t immediately available for
querying because the table index is only updated periodically (one second, by
default, which is a short time for me and you, but a long time for your code).
You can request a table refresh to update the index manually:
>>> connection = engine.connect()
>>> _ = connection.execute(text("REFRESH TABLE characters"))
Note
Newly inserted rows can still be queried immediately if a lookup by primary key is done.
Here’s what a regular select might look like:
>>> query = session.query(Character).order_by(Character.name)
>>> [(c.name, c.details['gender']) for c in query]
[('Arthur Dent', 'male'), ('Tricia McMillan', 'female')]
You can also select a portion of each record, and this even works inside ObjectType columns:
>>> sorted(session.query(Character.details['gender']).all())
[('female',), ('male',)]
You can also filter on attributes inside the ObjectType column:
>>> query = session.query(Character.name)
>>> query.filter(Character.details['gender'] == 'male').all()
[('Arthur Dent',)]
To filter on an ObjectArray, you have to do something like this:
>>> from sqlalchemy.sql import operators
>>> query = session.query(Character.name)
>>> query.filter(Character.more_details['foo'].any(1, operator=operators.eq)).all()
[(u'Arthur Dent',)]
Here, we’re using SQLAlchemy’s any
method along with Python’s operator.eq()
function, in order to
match the value 1
against the key foo
of any dictionary in the
more_details
list.
Only one of the keys has to match for the row to be returned.
This works, because ObjectArray
keys return a list of all values for that
key, like so:
>>> arthur.more_details['foo']
[1, 2, 3]
Querying a key of an ObjectArray
column will return all values for that key
for all matching rows:
>>> query = session.query(Character.more_details['foo']).order_by(Character.name)
>>> query.all()
[([1, 2, 3],), (None,)]
Aggregate functions¶
SQLAlchemy supports different ways to count result rows. However, because CrateDB doesn’t support subqueries, counts must be written in one of the following two ways.
This counts the number of character records by counting the number of id
values in the table:
>>> session.query(sa.func.count(Character.id)).scalar()
2
Note
If you’re doing it like this, the column you select must be the primary key.
And this counts the number of character records by selecting all columns, and then counting the number of rows:
>>> session.query(sa.func.count('*')).select_from(Character).scalar()
2
You can layer in calls to group_by
and order_by
when you use one of
these methods, like so:
>>> session.query(sa.func.count(Character.id), Character.name) \
... .group_by(Character.name) \
... .order_by(sa.desc(sa.func.count(Character.id))) \
... .order_by(Character.name).all()
[(1, u'Arthur Dent'), (1, u'Tricia McMillan')]
Fulltext search¶
Matching¶
Fulltext Search in CrateDB is done with the MATCH Predicate.
The CrateDB SQLAlchemy dialect provides a match
function, which can be used
to search one or multiple fields.
Here’s an example use of the match
function:
>>> from sqlalchemy_cratedb import match
>>> session.query(Character.name) \
... .filter(match(Character.name_ft, 'Arthur')) \
... .all()
[('Arthur Dent',)]
In this example, we’re selecting character name
values, and returning all
rows where the name_ft
index matches the string Arthur
.
Note
To use fulltext searches on a column, an explicit fulltext index with an analyzer must be created on the column. Consult the documentation about Fulltext indices for more information.
The match
function takes the following options:
match(column, term, match_type=None, options=None)
column
:A reference to a column or an index:
match(Character.name_ft, 'Trillian')
Or a subcolumn:
match(Character.details['name']['first'], 'Trillian')
Or a dictionary of the same, with boost values:
match({Character.name_ft: 0.5, Character.details['name']['first']: 0.8, Character.details['name']['last']: 0.2}, 'Trillian')
See also
The arguments reference of the MATCH Predicate has more in-depth information.
term
:The term to match against.
This string is analyzed and the resulting tokens are compared to the index.
match_type
:(optional)
The Match Types.
Determine how the
term
is applied and the _score gets calculated. See also score usage.Here’s an example:
match({Character.name_ft: 0.5, Character.details['name']['first']: 0.8, Character.details['name']['last']: 0.2}, 'Trillian', match_type='phrase')
options
:(optional)
The match options.
Specify match type behaviour. (Not possible without a specified match type.)
Match options must be supplied as a dictionary:
match({Character.name_ft: 0.5, Character.details['name']['first']: 0.8, Character.details['name']['last']: 0.2}, 'Trillian', match_type='phrase' options={ 'fuzziness': 3, 'analyzer': 'english'})
Relevance¶
To get the relevance of a matching row, the row _score can be used. See also score usage.
The score is relative to other result rows produced by your query. The higher the score, the more relevant the result row.
The score is made available via the _score
column, which is a virtual
column, meaning that it doesn’t exist on the source table, and in most cases,
should not be included in your table definition.
You can select _score
as part of a query, like this:
>>> session.query(Character.name, '_score') \
... .filter(match(Character.quote_ft, 'space')) \
... .all()
[('Tricia McMillan', ...)]
Here, we’re matching the term space
against the quote_ft
fulltext
index. And we’re selecting the name
column of the character by using the
table definition But notice that we select the associated score by passing in
the virtual column name as a string (_score
) instead of using a defined
column on the Character
class.