This post was written by Srini Kadamati and Marija Selakovic and was originally published on the Preset blog.
CrateDB is a scalable, open-source, time-series database that can handle various data types from time-series, full-text search, geospatial, and structured/semi-structured data. In addition, CrateDB offers SQL compatible interface, easy integration, and the capability to analyze massive amounts of data. These features make CrateDB an excellent choice for data-intensive analytics applications.
In this tutorial post, we'll showcase how to build the following Superset dashboard powered by data in CrateDB.
One of the easiest ways to install CrateDB on your local machine is to run the following command:
bash -c "$(curl -L https://try.crate.io/)"
This command downloads CrateDB and runs it from the tarball. It doesn’t require manual installation or extraction of release archives. If you would like to install CrateDB permanently please check out our installation guide .
The abovementioned command should open CrateDB Admin UI in your default browser. If this is not the case, access the Admin UI via the following link: https://localhost:4200. You should see the Admin UI as illustrated below:
In this article, we will explore and visualize the power consumption dataset that contains electricity usage for single houses in the city of Avignon in France. The dataset contains consumption data for several years, and it can be downloaded or used from this link.
The general idea is to explore trends in total energy consumption, differences between several measured utilities, and differences across various locations. But before we show how to import the dataset to CrateDB, let’s take a closer look into columns that describe consumption data:
Dataset provides the active power information as well as some division of the active power: kitchen, laundry, and climate controls systems.
Before we import data to CrateDB, we need to create a table with the schema as follows:
CREATE TABLE IF NOT EXISTS "doc"."power_consumption" (
"ts" TIMESTAMP WITH TIME ZONE,
"Global_active_power" REAL,
"Global_reactive_power" REAL,
"Voltage" REAL,
"Global_intensity" REAL,
"Sub_metering_1" REAL,
"Sub_metering_2" REAL,
"Sub_metering_3" REAL,
"meter_id" TEXT,
"location" GEO_POINT,
"city" TEXT
)
CLUSTERED INTO 4 SHARDS
To import the power consumption dataset, we use the COPY FROM
statement that copies data from a file into the table:
COPY "doc"."power_consumption" FROM 'path/power_consumption.json'
This command populates the table that can be further inspected from the Admin UI. To check some of the data samples, click on the power consumption table in the left navigation bar and then on the _Query Table button. The button triggers the SELECT
statement and shows the first 100 rows from the resulting output:
At this point the data is ready, and we can do further analysis and visualization with Apache Superset!
Apache Superset is the world's most popular open-source data visualization and exploration platform. The roots of Superset are in real-time analytics, from the struggles that Max Beauchemin (Preset’s founder and CEO) and his team had at Airbnb.
Ville Brofedlt and I added support for CrateDB in Superset back in 2021. This ended up being an easy task because the CrateDB team had already created a SQLAlchemy dialect for Crate.
Apache Superset
To connect to CrateDB from open source Superset, you need to install the crate Python library in the same Python context as the one that Superset is installed in.
You can read about how to accomplish just that here in the Superset docs. Here's the database specific page for Crate as well.
Preset Cloud
If you're using Preset Cloud, then the CrateDB driver is installed already for you.
Before we can start crafting charts & dashboards in Superset, we need to make our Superset instance aware of the CrateDB instance by registering the connection details.
crate://<username>:<password>@<url>:<port>
crate://crate@127.0.0.1:4200
{"connect_args":{"servers":["https://<url>:<port>"]}}
Superset ships with a diverse set of geospatial visualizations, but many of these chart plugins expect the values for Latitude and Longitude to reside in two separate columns. Currently, these values are represented in a single column titled location
set to the CrateDB GEOPOINT
data type.
In these situations, we usually have two paths forward:
If you find yourself re-using this data often, the better long term solution is to adjust the data schema in the database itself (the first option).
We'll showcase the second option right now, which works well when you're in a pinch. Calculated columns in Superset let you run any non-aggregating SQL statement that normally would go in the SELECT ..
clause and have this code augment every query & chart powered by this dataset in Superset. You can learn more here.
Let's start visualizing the data! To create a chart, click +Chart, choose the chart type, then click power_consumption from the Chose a dataset dropdown.
Map: Cluster of Homes
First, let's visualize the households clusters on a map. Select the Mapbox chart type. Then, select the following values in the Explore workflow.
Let's publish this chart to a new dashboard by clicking +Save in the top left corner:
Next, let's create a Table that totals the total energy usage by sub-meter across all households in this dataset by usage type. Here's a handy reference for the types of usage in this dataset:
Let's set the:
ts
(for timestamp)AGGREGATE
Optionally, we can rename the metrics to assign more semantic context to each one.
Here's what the final Table looks like (sorted by ts
descending):
Next, let's create a map that helps us visualize total energy usage across a zoomable, geographic area.
Longitude
and Latitude
columnsHere's the final result:
Let's end this tutorial by creating a time-series area chart that colors and stacks each energy graph by usage type.
Here's the beautiful result of these options:
Your dashboard should resemble the following screenshot:
Continue to jam on the dashboard!
Here are some helpful resources to continue your journey: