Support Features¶
The package bundles a few support and utility functions that try to fill a few gaps you will observe when working with CrateDB, a distributed OLAP database, since it lacks certain features, usually found in traditional OLTP databases.
A few of the features outlined below are referred to as polyfills, and emulate a few functionalities, for example, to satisfy compatibility issues on downstream frameworks or test suites. You can use them at your disposal, but you should know what you are doing, as some of them can seriously impact performance.
Other features include efficiency support utilities for 3rd-party frameworks, which can be used to increase performance, mostly on INSERT operations.
Bulk Support for pandas and Dask¶
Background
CrateDB’s Bulk operations interface enables efficient INSERT, UPDATE, and DELETE operations for batches of data. It enables bulk operations, which are executed as single calls on the database server.
Utility
The insert_bulk
utility provides efficient bulk data transfers when using
dataframe libraries like pandas and Dask. DataFrame operations dedicates a whole
page to corresponding topics, about choosing the right chunk sizes, concurrency
settings, and beyond.
Synopsis
Use method=insert_bulk
on pandas’ or Dask’s to_sql()
method.
import sqlalchemy as sa
from sqlalchemy_cratedb.support import insert_bulk
from pueblo.testing.pandas import makeTimeDataFrame
# Create a pandas DataFrame, and connect to CrateDB.
df = makeTimeDataFrame(nper=42, freq="S")
engine = sa.create_engine("crate://")
# Insert content of DataFrame using batches of records.
df.to_sql(
name="testdrive",
con=engine,
if_exists="replace",
index=False,
method=insert_bulk,
)
Context Manager table_kwargs
¶
Background
CrateDB’s special SQL DDL options to support Partitioned tables, Sharding, or Replication sometimes can’t be configured easily when SQLAlchemy is wrapped into a 3rd-party framework like pandas or Dask.
Utility
The table_kwargs
utility is a context manager that is able to forward CrateDB’s
dialect-specific table creation options to the sa.Table()
constructor call sites
at runtime.
Synopsis
Using a context manager incantation like outlined below will render a
PARTITIONED BY ("time")
SQL clause, without touching the call site of
sa.Table(...)
.
from sqlalchemy_cratedb.support import table_kwargs
with table_kwargs(crate_partitioned_by="time"):
return df.to_sql(...)
Synthetic Autoincrement using Timestamps¶
Background
CrateDB does not provide traditional sequences or SERIAL
data type support,
which enable automatically assigning incremental values when inserting records.
Utility
The
patch_autoincrement_timestamp
utility emulates autoincrement / sequential ID behavior for designated columns, based on assigning timestamps on record insertion.It will simply assign
sa.func.now()
as a columndefault
on the ORM model column.It works on the SQLAlchemy column types
sa.BigInteger
,sa.DateTime
, andsa.String
.You can use it if adjusting ORM models for your database adapter is not an option.
Synopsis
After activating the patch, you can use autoincrement=True
on column definitions.
import sqlalchemy as sa
from sqlalchemy.orm import declarative_base
from sqlalchemy_cratedb.support import patch_autoincrement_timestamp
# Enable patch.
patch_autoincrement_timestamp()
# Define database schema.
Base = declarative_base()
class FooBar(Base):
id = sa.Column(sa.DateTime, primary_key=True, autoincrement=True)
Warning
CrateDB’s TIMESTAMP
data type provides
milliseconds granularity. This has to be considered when evaluating collision
safety in high-traffic environments.
Synthetic Table REFRESH after DML¶
Background
CrateDB is eventually consistent. Data written with a former statement is not guaranteed to be fetched with the next following select statement for the affected rows.
Data written to CrateDB is flushed periodically, the refresh interval is 1000 milliseconds by default, and can be changed. More details can be found in the reference documentation about table refreshing.
There are situations where stronger consistency is required, for example when needing to satisfy test suites of 3rd party frameworks, which usually do not take such special behavior of CrateDB into consideration.
Utility
The
refresh_after_dml
utility will configure an SQLAlchemy engine or session to automatically invokeREFRESH TABLE
statements after each DML operation (INSERT, UPDATE, DELETE).Only relevant (dirty) entities / tables will be considered to be refreshed.
Synopsis
import sqlalchemy as sa
from sqlalchemy_cratedb.support import refresh_after_dml
engine = sa.create_engine("crate://")
refresh_after_dml(engine)
import sqlalchemy as sa
from sqlalchemy.orm import sessionmaker
from sqlalchemy_cratedb.support import refresh_after_dml
engine = sa.create_engine("crate://")
session = sessionmaker(bind=engine)()
refresh_after_dml(session)
Warning
Refreshing the table after each DML operation can cause serious performance degradations, and should only be used on low-volume, low-traffic data, when applicable, and if you know what you are doing.
Synthetic UNIQUE Constraints¶
Background
CrateDB does not provide UNIQUE
constraints in DDL statements. Because of its
distributed nature, supporting such a feature natively would cause expensive
database cluster operations, negating many benefits of using database clusters
firsthand.
Utility
The
check_uniqueness_factory
utility emulates “unique constraints” functionality by querying the table for unique values before invoking SQLINSERT
operations.It uses SQLALchemy ORM Events, more specifically the before_insert mapper event.
When the uniqueness constraint is violated, the adapter will raise a corresponding exception.
IntegrityError: DuplicateKeyException in table 'foobar' on constraint 'name'
Synopsis
import sqlalchemy as sa
from sqlalchemy.orm import declarative_base
from sqlalchemy.event import listen
from sqlalchemy_cratedb.support import check_uniqueness_factory
# Define database schema.
Base = declarative_base()
class FooBar(Base):
id = sa.Column(sa.String, primary_key=True)
name = sa.Column(sa.String)
# Add synthetic UNIQUE constraint on `name` column.
listen(FooBar, "before_insert", check_uniqueness_factory(FooBar, "name"))
Note
This feature will only work well if table data is consistent, which can be
ensured by invoking a REFRESH TABLE
statement after any DML operation.
For conveniently enabling “always refresh”, please refer to the documentation
section about Synthetic Table REFRESH after DML.
Warning
Querying the table before each INSERT operation can cause serious performance degradations, and should only be used on low-volume, low-traffic data, when applicable, and if you know what you are doing.