Getting started with Apache Airflow

Automate CrateDB queries with Apache Airflow.

Introduction

This guide shows how to use Apache Airflow with CrateDB to automate recurring queries.

You will:

  • understand Astronomer, a managed Apache Airflow platform,

  • set up a local project with the Astronomer CLI, and

  • schedule and execute recurring queries with simple examples.

Apache Airflow

Apache Airflow programmatically creates, schedules, and monitors workflows (Official documentation). A workflow is a directed acyclic graph (DAG) where each node represents a task. Each task runs independently; the DAG tracks dependencies. Run DAGs on demand or on schedules (for example, twice a week).

CrateDB

CrateDB is an open-source, distributed database for storing and analyzing large volumes of data. It offers high scalability, flexibility, and availability, supports dynamic schemas and queryable objects, and provides time series features and real-time full-text search over millions of documents in seconds.

Because CrateDB powers large-scale data workloads, many deployments automate recurring tasks. Apache Airflow’s resilient, scalable architecture makes it a strong choice for orchestrating those tasks on CrateDB.

Astronomer

Since 2014, Apache Airflow and its ecosystem have grown significantly. To run Airflow in production, you need to understand both Airflow and the underlying deployment infrastructure.

To simplify operations, use a managed Apache Airflow provider such as Astronomer. Astronomer runs on Kubernetes, abstracts infrastructure details, and provides a clean interface for building and operating workflows.

1

Set up a local Airflow project

The examples use an 8‑core machine with 30 GB RAM on Ubuntu 22.04 LTS. Install the Astronomer CLI (requires Docker 18.09+). On Ubuntu:

curl -sSL install.astronomer.io | sudo bash -s

Verify the installation:

astro version

Example output:

Astro CLI Version: 1.14.1

For other operating systems, follow the official documentation. After installing the Astronomer CLI, initialize a new project:

  • Create a project directory:

    mkdir astro-project && cd astro-project
    
  • Initialize the project with the following command:

    astro dev init
    
  • This will create a skeleton project directory as follows:

    ├── Dockerfile
    ├── README.md
    ├── airflow_settings.yaml
    ├── dags
    ├── include
    ├── packages.txt
    ├── plugins
    ├── requirements.txt
    └── tests
    

The astronomer project consists of four Docker containers:

  • PostgreSQL server (for configuration/runtime data)

  • Airflow scheduler

  • Web server for rendering Airflow UI

  • Triggerer (running an event loop for deferrable tasks)

The PostgreSQL server listens on port 5432. The web server listens on port 8080 and is available at http://localhost:8080/ with admin/admin.

If these ports are already in use, change them in .astro/config.yaml. For example, set the webserver to 8081 and PostgreSQL to 5435:

project:
  name: astro-project
webserver:
  port: 8081
postgres:
  port: 5435

Start the project with astro dev start. After the containers start, access the Airflow UI at http://localhost:8081:

Airflow UI landing page

The landing page of Apache Airflow UI shows the list of all DAGs, their status, the time of the next and last run, and the metadata such as the owner and schedule. From the UI, you can manually trigger the DAG with the button in the Actions section, manually pause/unpause DAGs with the toggle button near the DAG name, and filter DAGs by tag. If you click on a specific DAG it will show the graph with tasks and dependencies between each task.

2

Create a GitHub repository

To track the project with Git, execute from the astro-project directory: git init.

Go to https://github.com and create a new repository. Add files that store sensitive information (for example, credentials and environment variables) to .gitignore, such as:

.env
airflow_settings.yaml
**/secrets.*

Then publish astro-project to GitHub:

git remote add origin https://github.com/username/new_repo
git push -u origin main

The initialized astro-project now has a home on GitHub.

3

Add database credentials

To configure the CrateDB connection, set an environment variable. On Astronomer, set it via the UI, Dockerfile, or the .env file (generated during initialization).

In this guide, you will set up the necessary environment variables via a .env file. To learn about alternative ways, please check the Astronomer Environment variables documentation. The first variable to define is one for the CrateDB connection, as follows:

AIRFLOW_CONN_CRATEDB_CONNECTION=postgresql://<user>:<password>@<host>/?sslmode=disable

For TLS, set sslmode=require. To confirm that the variable is applied, start the project and open a bash session in the scheduler container: docker exec -it <scheduler_container_name> /bin/bash.

Run env to list the applied environment variables.

This will output some variables set by Astronomer by default including the variable for the CrateDB connection.