Use CrateDB and Apache Superset for Open Source Data Warehousing

Written by Kevin Quarshie | 2019-07-12

Note: we have published an updated version of this blogpost

Apache Superset is an easy-to-use open-source business intelligence (BI) application that offers a rich set of customizable data visualizations. With Superset, you can explore, filter, and organize your data. You can also create and share custom monitoring and analytics dashboards.

CrateDB is a distributed SQL database that makes it simple to capture, store, and analyze massive amounts of machine data in real-time.

In this post, I'll show you how to get started with CrateDB and Apache Superset. And by the end of this post, you should have your very first Superset visualizations running on top of CrateDB data.

Note: I'm using macOS, but my instructions should be trivially adaptable for Linux or Microsoft Windows.

Set up a working directory

Create a new directory

$ mkdir crate-superset-demo

Change into it:

$ cd crate-superset-demo

We want to use this as a working directory for the rest of this tutorial, so make sure to run all subsequent commands from here.

Set up CrateDB

Get CrateDB set up with a single command:

$ bash -c "$(curl -L https://try.crate.io/)"

This command downloads CrateDB and runs it from the tarball. If you'd like to install CrateDB a bit more permanently, or you are using Windows, check out our collection of one-step install guides.

If you're using the command above, it should pop open the CrateDB admin UI for you automatically once it has finished. Otherwise, head over to http://localhost:4200/ in your browser.

If you ran the command above, it should pop open the CrateDB Admin UI for you automatically once it has finished. Otherwise, head over to http://localhost:4200/ in your browser.

You should see something like this:

Get some data

If you’re playing around with a fresh CrateDB install, likely, you don't have any data. So head on over to the Help screen by selecting the question mark icon on the left-hand navigation menu.

The help screen looks like this:

Select IMPORT TWEETS FOR TESTING and follow the instructions to authenticate your Twitter account.

Don’t worry. This isn’t going to post anything on your behalf. It doesn’t even look at your tweets. All this does is import a bunch of recent public tweets on Twitter.

Once you're done, select the Tables icon from the left-hand navigation, and then select the tweets table. You should end up here:

http://localhost:4200/#/tables/doc/tweets

Which should look like this:

Great! CrateDB is set up and running, and you have some test data.

For this demo, I want to show you how to graph a pie chart of the different Twitter clients used to post the tweets you just imported.

To do that, we need to group on the information in the source column (i.e., the client identification). However, Twitter returns this data as HTML, which we don't want.

For expediency, let's create a view that strips out the HTML with a simple regular expression. (Note: You should not process HTML with regular expressions for any real-world application!)

Navigate to the Console tab, using the left-hand navigation, and then execute the following query in the query console:

CREATE VIEW tweets_source AS
    SELECT id,
           regexp_matches(source, '>(.*)<')[1] AS source_txt
      FROM tweets;

You should see this:

You can test the view with this query:

SELECT * FROM tweets_source limit 100;

Like so:

Nice. Let's move on.

Install Apache Superset

For this tutorial, we'll stick to the official documentation and install Superset inside a Python virtual environment.

Check which version of Python you're running:

$ python3 --version
Python 3.7.2

If you're using Python 3.3 or newer, run this command to set up a Python virtual environment:

$ python3 -m venv env

If you're running an older version of Python, consult the Superset docs for help installing the virtualenv module.

With that done, you can activate the virtual environment, like so:

$ source env/bin/activate

(For more help working with Python virtual environments, consult the Python documentation.)

Before we continue, and as a precautionary measure, make sure that your Python packaging tools are up-to-date:

$ pip install --upgrade setuptools pip

Unfortunately, at the time of writing, Superset doesn't specify its package dependencies correctly. So, we need a workaround.

First of all, fetch the most up-to-date requirements.txt file:

$ wget https://raw.githubusercontent.com/apache/incubator-superset/master/requirements.txt

Then, install the required Python packages:

$ pip install -r requirements.txt

Then, add this:

$ pip install sqlalchemy==1.2.18

Now we're ready to install Superset:

$ pip install superset

Finally, install the CrateDB Python Client library so that Superset can speak to CrateDB:

$ pip install crate

Great! Now, let's get Superset running.

Run Superset

Start by initializing the Superset database:

$ superset db upgrade

INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 4e6a06bad7a8, Init
INFO  [alembic.runtime.migration] Running upgrade 4e6a06bad7a8 -> 5a7bad26f2a7, empty message
INFO  [alembic.runtime.migration] Running upgrade 5a7bad26f2a7 -> 1e2841a4128, empty message
INFO  [alembic.runtime.migration] Running upgrade 1e2841a4128 -> 2929af7925ed, TZ offsets in data sources
[...]
INFO  [alembic.runtime.migration] Running upgrade 7fcdcde0761c -> 0c5070e96b57, add user attributes table
INFO  [alembic.runtime.migration] Running upgrade 0c5070e96b57 -> 1a1d627ebd8e, position_json
INFO  [alembic.runtime.migration] Running upgrade 1a1d627ebd8e -> 55e910a74826, add_metadata_column_to_annotation_model.py

Then, set up the environment:

$ export FLASK_APP=superset

Create an admin user:

$ flask fab create-admin

Username [admin]: 
User first name [admin]: 
User last name [user]: 
Email [admin@fab.org]: 
Password: 
Repeat for confirmation: 
Recognized Database Authentications.
Admin User admin created.

Set up the default roles and permissions:

$ superset init

2019-07-12 14:43:01,037:INFO:root:Creating database reference
2019-07-12 14:43:01,055:INFO:root:Syncing role definition
2019-07-12 14:43:01,121:INFO:root:Syncing Admin perms
[...]
2019-07-12 14:43:02,510:INFO:root:Creating missing database permissions.
2019-07-12 14:43:02,530:INFO:root:Creating missing metrics permissions
2019-07-12 14:43:02,536:INFO:root:Cleaning faulty perms

Finally, run Superset:

$ superset run -p 8080 --with-threads --reload --debugger

2019-07-12 14:43:45,211:INFO:werkzeug: * Running on http://127.0.0.1:8080/ (Press CTRL+C to quit)
2019-07-12 14:43:45,213:INFO:werkzeug: * Restarting with stat
2019-07-12 14:43:48,036:WARNING:werkzeug: * Debugger is active!
2019-07-12 14:43:48,058:INFO:werkzeug: * Debugger PIN: 280-982-752

Now, visit http://127.0.0.1:8080/ in your browser:

Afterward, you should see this:

Fantastic!

Connect to CrateDB

Select Sources → Databases from the top navigation menu:

Then, select the Add new record button (the icon the top right-hand corner of the screen).

On the Add Database screen, fill out the following information:

Database: CrateDB
SQLAlchemy URI: crate://crate@127.0.0.1:4200

Like so:

Then, select the Test Connection button.

You should be presented with a modal popup window saying "Seems OK!"

For now, we can ignore all of the other configuration options. Scroll to the bottom of the page and select Save.

Here's your created database source:

Now, select Sources → Tables from the top navigation menu:

We're going to set up a source for the tweets_source view we created earlier.

Select the Add new record button (the icon the top right-hand corner of the screen).

On the Import a table definition screen, fill out the following information:

Database: CrateDB
Schema: doc (the default CrateDB schema)
Table Name: tweets_source

Like so:

This works because views function as tables!

Select Save.

Because our view is so simple, there's no need to configure anything. You can ignore the blue notification at the top of the screen.

Create a pie chart

Select Charts from the top navigation menu.

Then, select the Add new record button (the icon the top right-hand corner of the screen).

On the Create a new chart screen, fill out the following information:

Datasource: docs.tweet_source
Visualization type: Pie Chart

Like so:

Select Create new chart.

You should be presented with the following screen:

From here:

In the Time section of the left-hand configuration panel, select Last week, change it to No filter, and then select Ok.
In the Query section, select Group by and choose source_txt.

Now, select the Run Query button in the main workspace area.

Et voila! A pie chart.

You can move your cursor over the pie chart to get specific values.

In my case, Twitter for iPhone was responsible for 524 tweets, and Twitter for Android was responsible for 494.

From here, you can title the chart and save it.

Wrap Up

Apache Superset is a modern, enterprise-ready business intelligence web application. And CrateDB is a distributed SQL database. Together, they provide a powerful open source platform for data warehousing and business intelligence.

In this post, I showed you how to:

Get CrateDB up-and-running
Import some test data
Create a view
Install Apache Superset
Set up a connection to CrateDB
Visualize your data

From here, you might want to go on to create your first Superset dashboard.

Questions? Looking for help? Get in touch!

View full post