Comparing queries: InfluxDB 2.0 (Flux) vs CrateDB (SQL)

In this blog post, we are going to look at the differences between the query syntax of InfluxDB 2.0 (Flux) and CrateDB (SQL). This query comparison is done using a public time series dataset consisting of around 70K rows of weather station measurements. The data includes timestamped records with various meteorological parameters such as temperature, humidity, wind speed, wind direction, precipitation, pressure, and general weather conditions.

InfluxDB Terminology and Data Types

First, let’s review some InfluxDB terminology. InfluxDB stores time series data in buckets and measurements, while CrateDB stores data in schemas and tables.

A bucket is a named location with a retention policy where time series data is stored.
A series is a logical grouping of data defined by shared measurement, tag, and field.
A measurement is like an SQL database table.
A tag is like an indexed column in an SQL database.
A field is like an un-indexed column in an SQL database.
A point is like an SQL row.

Now, let's have a look at the supported data types both in InfluxDB and CrateDB. We can see that CrateDB covers all InfluxDB data types and offers larger capabilities around object and array types.

	InfluxDB	CrateDB
Integer	Integer (64-bit signed)	Byte (8-bit), Short (16-bit), Integer (32-bit), Long (64-bit)
Float	Float (64-bit IEEE-754)	Float (32-bit IEEE-754), Double (64-bit IEEE-754)
String	String (Unicode)	Varchar or Text (variable-length Unicode)
Boolean	Boolean (True or False)	Boolean (True or False)
Timestamp	Timestamp (nanosecond-precision Unix timestamp) Does NOT natively include timezone information	Timestamp (millisecond-precision), timestamp with time zone
Array	Array (homogeneous arrays of any supported type)	Array (arrays of any other data type)
Object	Object (key-value pairs where keys are strings and values can be any supported type)	Object (Structured as a collection of key values. An object can contain any other type, including further child objects)
Geospatial	Not supported natively	Geo_point (Stores a point as latitude and longitude), Geo_shape (Stores more complex shapes like polygons, lines)
Vector	Not supported natively	Float_vector (allows to store dense vectors of float values of fixed length)

Query Examples

Now, let's compare Flux and SQL queries for typical time-series scenarios.

Aggregations:

Example: Computing the average temperature for each location.

This query type helps analyze temperature trends across different locations in environmental monitoring systems, smart cities, or agriculture.

Flux:

from(bucket: "demo-bucket")
|> range(start: -100y)
|> filter(fn: (r) => r._measurement == "weather_data" and r._field == "temperature")
|> group(columns: ["location"])
|> mean()

SQL:

SELECT location, AVG(temperature)
FROM "demo"."weather_data"
GROUP BY location
ORDER BY location
LIMIT 100;

Transformation:

Example: Converting temperature from Celsius to Fahrenheit.

This type of transformation query is proper when integrating data with systems or reports requiring different units or formats, ensuring application consistency and compatibility.

Flux:

from(bucket: "demo-bucket")
|> range(start: -100y)
|> filter(fn: (r) => r._measurement == "weather_data" and r._field == "temperature")
|> group(columns: ["location"])
|> mean()
|> duplicate(column: "_value", as: "_value_celsius")
|> map(fn: (r) => ({ r with _value_fahrenheit: (r._value * 9.0 / 5.0) + 32.0 }))
|> keep(columns: ["_time", "location", "_value_celsius", "_value_fahrenheit"])
|> yield(name: "average_temperature_celsius_and_fahrenheit")

SQL:

SELECT 
location,
AVG(temperature) AS average_temperature_celsius,
(AVG(temperature) * (9.0 / 5.0) + 32.0) AS average_temperature_fahrenheit
FROM demo.weather_data
WHERE 
time >= CURRENT_TIMESTAMP - INTERVAL '100 years'
GROUP BY location 
ORDER by 1 
limit 100;

Windowing:

Example: Computing the average temperature in 1-hour windows

This query type is useful for analyzing temperature trends over time in manageable chunks, allowing for detailed temporal insights in monitoring and reporting applications.

Flux:

from(bucket: "demo-bucket")
|> range(start: 2023-08-30T00:00:00Z, stop: 2023-08-30T23:59:59Z)
|> filter(fn: (r) => r._measurement == "weather_data" and r._field == "temperature")
|> aggregate Window(every: 1h, fn: mean, create Empty: false)
|> map(fn: (r) => ({
r with
window start: time(v: int(v: r._time) - int(v: r._time) % 3600000000000),
windowed: time(v: int(v: r._time) - int(v: r._time) % 3600000000000 + 3600000000000),
mean_temperature: r._value
}))
|> keep(columns: ["window_start", "window_end", "mean_temperature", "location"])
|> group()
|> sort(columns: ["window_start", "location"])
|> limit(n: 10)
|> yield(name: "mean_temperature_per_hour")

SQL:

SELECT
to_char(date_trunc('hour', time at time zone 'UTC'), 'YYYY-MM-DD HH24:MI:SS') AS window_start,
to_char(date_trunc('hour', time at time zone 'UTC') + INTERVAL '1 hour', 'YYYY-MM-DD HH24:MI:SS') AS window_end,
AVG(temperature) AS mean_temperature,
location
FROM demo.weather_data
WHERE
time BETWEEN '2023-08-30 00:00:00' AND '2023-08-30 23:59:59'
GROUP BY window_start, window_end, location
ORDER BY window_start, location
LIMIT 10;

Interpolation:

Example: Interpolating temperature values using linear interpolation.

This interpolation is useful for ensuring data continuity and accuracy in analyses. It's worth noting that the formatting and sorting here is a bit different between Flux and CrateDB. This is because Influx uses microsecond precision and CrateDB milliseconds, offering different levels of accuracy for different needs.

Flux:

from(bucket: "demo-bucket")
|> range(start: -1y)
|> filter(fn: (r) => r._measurement == "weather_data" and r._field == "temperature")
|> aggregate Window(every: 1h, fn: mean, createEmpty: true)
|> duplicate(column: "_value", as: "original_value")
|> fill(column: "_value", usePrevious: true)
|> map(fn: (r) => ({ r with interpolated_value: r._value, _value: r.original_value }))
|> yield(name: "original_and_interpolated_temperature")

SQL:

WITH hourly_data AS (
SELECT 
date_trunc('hour', time) AS window_start,
AVG(temperature) AS original_value,
location
FROM 
demo.weather_data
WHERE 
time >= CURRENT_TIMESTAMP - INTERVAL '1 year'
GROUP BY 
window_start, location
),
filled_data AS (
SELECT 
window_start,
location,
original_value,
COALESCE(original_value, LAG(original_value) OVER (PARTITION BY location ORDER BY window_start)) AS interpolated_previous_value,
COALESCE(original_value, LEAD(original_value) OVER (PARTITION BY location ORDER BY window_start)) AS interpolated_next_value
FROM 
hourly_data
)
SELECT 
window_start AS timestamp,
location,
original_value AS temperature_celsius,
interpolated_previous_value AS interpolated_previous_temperature,
interpolated_next_value AS interpolated_next_temperature
FROM 
filled_data
where original_value is NULL
LIMIT 10;

Downsampling:

Example: Computing the max temperature per week for each location in the last 12 months.

This query type is useful for long-term trend analysis and reporting, allowing you to focus on significant temperature peaks and reduce data volume while preserving critical information about extreme weather conditions.

Flux:

from(bucket: "demo-bucket")
|> range(start: -1y)
|> filter(fn: (r) => r._measurement == "weather_data" and r._field == "temperature")
|> aggregateWindow(every: 7d, fn: max, createEmpty: false)
|> map(fn: (r) => ({ r with window_max_time: r._time }))
|> keep(columns: ["_start", "_stop", "window_max_time", "_value", "location", "_field", "_measurement"])
|> yield(name: "weekly_max_temperature")

SQL:

WITH weekly_data AS (
SELECT 
date_trunc('week', time) AS window_start,
date_trunc('week', time) + INTERVAL '7 days' AS window_end,
MAX(temperature) AS max_temperature,
MAX(time) AS window_max_time,
location
FROM 
demo.weather_data
WHERE 
time >= CURRENT_TIMESTAMP - INTERVAL '1 year'
GROUP BY 
window_start, window_end, location
)
SELECT 
window_start AS _start,
window_end AS _stop,
window_max_time,
max_temperature AS _value,
location,
'temperature' AS _field,
'weather_data' AS _measurement
FROM 
weekly_data
ORDER BY 
window_start, location
LIMIT 100;

Grouping multiple columns:

Example: Computing the average temperature and humidity for each location.

This query helps gain comprehensive insights into environmental conditions at different locations, which can be applied in climate studies, innovative city management, and agricultural monitoring.

Flux:

from(bucket: "demo-bucket")
|> range(start: -1y)
|> filter(fn: (r) => r._measurement == "weather_data" and (r._field == "temperature" or r._field == "humidity"))
|> group(columns: ["location", "_field"])
|> mean()

SQL:

SELECT 
location,
AVG(temperature) AS average_temperature,
AVG(humidity) AS average_humidity
FROM demo.weather_data
WHERE 
time >= CURRENT_TIMESTAMP - INTERVAL '1 year'
GROUP BY location
ORDER BY location;

Filtering and Conditional Logic:

Example: Getting all temperature readings above 30 degrees Celsius over the past 11 months.

This type of query is useful for identifying periods of high temperature, which can be crucial for heatwave analysis, monitoring climate change impacts, and managing energy consumption in response to extreme weather conditions.

Flux:

from(bucket: "demo-bucket")
|> range(start: -11mo)
|> filter(fn: (r) => r._measurement == "weather_data" and r._field == "temperature" and r._value > 30.0)

SQL:

SELECT time,
location,
temperature
FROM 
demo.weather_data
WERE 
time >= CURRENT_TIMESTAMP - INTERVAL '11 months'
AND temperature > 30.0 
ORDER BY location

Combining Queries:

Example: Computing both the average and maximum temperature for each location.

Flux:

temperature_mean = from(bucket: "demo-bucket")
  |> range(start: -1y)
  |> filter(fn: (r) => r._measurement == "weather_data" and r._field == "temperature")
  |> group(columns: ["location"])
  |> mean()

temperature_max = from(bucket: "demo-bucket")
  |> range(start: -1y)
  |> filter(fn: (r) => r._measurement == "weather_data" and r._field == "temperature")
  |> group(columns: ["location"])
  |> max()

join(tables: {mean: temperature_mean, max: temperature_max}, on: ["location"])

SQL:

SELECT 
    location,
    AVG(temperature) AS average_temperature,
    MAX(temperature) AS max_temperature
FROM 
    demo.weather_data
WHERE 
    time >= CURRENT_TIMESTAMP - INTERVAL '1 year'
GROUP BY 
    location
ORDER BY location

Moving Average:

Example: Listing the average of the last 5 values for each timestamp.

This type of query is helpful for smoothing temperature data to reveal longer-term trends, which can be valuable in time-series analysis for fields such as climate monitoring, finance, and performance metrics.

Flux:

from(bucket: "demo-bucket")
|> range(start: -1y)
|> filter(fn: (r) => r._measurement == "weather_data" and (r._field == "temperature" or r._field == "humidity"))
|> group(columns: ["location", "_field"])
|> mean()

SQL:

SELECT
    to_char(time, 'DD-MM-yyyy HH24:MI:SS'),
    location,
    temperature,
    AVG(temperature) OVER (
        PARTITION BY location
        ORDER BY time
        ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
    ) AS moving_average_temperature
FROM
    demo.weather_data
WHERE
    location = 'Berlin'
AND
    time BETWEEN '2023-08-30 00:00:00' AND '2023-08-30 23:59:59'
ORDER BY
    time DESC;

Comparing queries: InfluxDB 2.0 (Flux) vs CrateDB (SQL)

InfluxDB Terminology and Data Types

Query Examples

Aggregations:

Transformation:

Windowing:

Interpolation:

Downsampling:

Grouping multiple columns:

Filtering and Conditional Logic:

Combining Queries:

Moving Average:

Company

Ecosystem

Contact

Comparing queries: InfluxDB 2.0 (Flux) vs CrateDB (SQL)

InfluxDB Terminology and Data Types

Query Examples

Aggregations:

Transformation:

Windowing:

Interpolation:

Downsampling:

Grouping multiple columns:

Filtering and Conditional Logic:

Combining Queries:

Moving Average:

Related Posts

How CrateDB Minimizes Data Footprints Without Compromising Performance

Announcing MongoDB CDC Integration (Public Preview) in CrateDB Cloud

Time Series Benchmark - CrateDB vs MongoDB and InfluxDB