Skip to content
Explore

Try CrateDB Live: More Queries

Scenario: Weather Monitoring
  • 1. Choose Scenario
  • 2. Get Ready
  • 3. Run CrateDB
  • 4. Import Data
  • 5. Explore Queries
  • 6. More Queries
  • 7. Connect
  • 8. Next Steps

B. Geospatial Queries

B1. How to extract latitude & longitude from a GEO_POINT

You can use the latitude and longitude functions to return the co-ordinates in a GEO_POINT datatype. This example will return the bounding box for all loaded data:

SELECT
  min(longitude(geo_location)) AS min_longitude,
  max(longitude(geo_location)) AS max_longitude,
  min(latitude(geo_location)) AS min_latitude,
  max(latitude(geo_location)) AS max_latitude
FROM
  demo.climate_data;

Let’s use the result of this to work out how many data points there are in a subset of the data, take this query and fill in some correct values based on the result of the above query.

SELECT COUNT(*) AS points
FROM demo.climate_data
WHERE data['latitude'] BETWEEN 47.40 AND 54.29
  AND data['longitude'] BETWEEN 6.08 AND 14.99;

If you get it right, this should return a non-zero count. If it’s zero, check your query and try again. More information on CrateDB’s geospatial capabilities is detailed here.

A more powerful way of doing this is to use the WITHIN function. This allows us to find points that lie within a polygon, which gives us the flexibility to use shapes other than a rectangle. The SQL below returns the latest temperatures from Bavaria, converted from Kelvin:

SELECT
  d.measurement_time as time,
  latitude(d.geo_location) as latitude,
  longitude(d.geo_location) as longitude,
  data['temperature'] -273.15 as temperature,
  gp.nearest_town
FROM
     demo.climate_data d
    ,demo.german_regions r
    ,demo.geo_points gp
WHERE WITHIN(d.geo_location, r.geo_coords)
AND gp.geo_location = d.geo_location
AND r.region_name = 'Bayern' 
AND  d.measurement_time = (SELECT max(d2.measurement_time) FROM demo.climate_data d2);
Bavaria region in Germany

The list of points lives in a GEO_SHAPE column called geo_coords in the german_regions table, and is made of latitude/longitude pairs that form the boundary:

longitude1 latitude1, longitude2 latitude2, longitude3 latitude3 ...

If you draw these on a map they will effectively define the shape of the Bavaria region.

Note also that we did a three table join above.

B2. Distance calculations:

CrateDB also supports distance calculations, the following query gives the nearest points to Berlin, Germany (approximate location is 52.5N, 13.4E)

SELECT
  geo_location,
  distance(geo_location, 'POINT(13.4 52.5)') AS metres,
  min(data['temperature']) -273.15 AS min_temp,
  max(data['temperature']) -273.15 AS max_temp
FROM demo.climate_data
GROUP BY geo_location, metres
ORDER BY metres ASC
LIMIT 50;

C. Full text search Queries – access Apache Lucene using SQL

CrateDB uses Apache's Lucene to provide full text search functionality. In the german_regions table we have a column called economics, which contains a 2000 word summary for the economy of each region.

In the query below, we ask it to return the three regions that have the highest score for 'cars':

SELECT region_name, _score 
FROM demo.german_regions 
WHERE MATCH(economics, 'cars') 
ORDER By _score DESC limit 3;
This returns:
region_name      _score
Thüringen        0.9552864
Niedersachsen    0.5259006
Hessen           0.38886413

What's interesting here is that the same database is doing normal RDBMS, GeoSpatial stuff and full text search.


D. Vector Search

CrateDB supports semantic ("vector") search natively via the FLOAT_VECTOR datatype and the KNN_MATCH operator. Unlike full text search, which matches on words, vector search matches on meaning. Two phrases with no words in common can still score highly if they describe the same concept.

In our demo.german_regions table the embedding column is declared as FLOAT_VECTOR(1536). Each row's embedding is a 1536-dimensional vector generated by OpenAI's text-embedding-3-small model, summarising the four text fields (tourism_info, transportation, economics, introduced_species) for that region.

To search this column you need a small Python helper: cratedb_knn_search.py, and a requirements.txt. Download them and save it next to your .env.

Then install the required packages:

pip install -r requirements.txt

The script needs an OpenAI API key. The reason is that KNN_MATCH compares vectors, not strings — so before the script can ask CrateDB "which region is closest to wine country?", it has to convert the phrase wine country into a 1536-dim vector using the same OpenAI model that produced the embeddings in the table. The key is used once per query, from your laptop, to call api.openai.com/v1/embeddings; it is never sent to CrateDB.

Set the key and run:

export OPENAI_API_KEY=sk-...
echo "wine country" | python cratedb_knn_search.py \
  --host localhost --database demo
Expected output:
Region                          Score
------------------------------  -----
#1  Bayern                        0.4088
#2  Rheinland-Pfalz               0.3928
#3  Baden-Württemberg             0.3819
#4  Saarland                      0.3800
#5  Hessen                        0.3731

Notice that the word "wine" appears nowhere in any region's name — the model ranked Rheinland-Pfalz and Baden-Württemberg (Germany's two largest wine-growing states) at the top purely on the meaning of the surrounding text. That's the difference between full text and vector search: the former needs the word; the latter needs only the idea.

E. Queries – JSON object model

E1. Insert a new value into the dataset:

INSERT INTO demo.climate_data (measurement_time, geo_location, data) VALUES (123, [8.78831111111111, 54.903], {"longitude" = 8.78831111111111, "latitude" = 54.903, "temperature" = 16.868310546875023, "u10" = 4.472952365875244, "v10" = -1.3958832025527954, "pressure" = 102426.1015625});

This is using the existing schema. What if we want to add a new 'humidity' field to our dataset, do we need to recreate the table?

In CrateDB we can define object columns (i.e. JSON) as one of three different types:

  • OBJECT(STRICT) – The schema for the object is fixed
  • OBJECT(DYNAMIC) – The schema is dynamic, and new attributes can be added
  • OBJECT(IGNORED) – The JSON is stored ignoring the datatypes of the attributes and not indexing the field

See this page for more detailed information.

When you created this table, you set the data column to be DYNAMIC, so we can add additional attributes easily, and these attributes will be indexed.

Add a new key in the JSON object without the need to modify the schema.

INSERT INTO demo.climate_data (measurement_time, geo_location, data) VALUES (123, [8.78831111111111, 54.903], {"humidity" = 43.4, "longitude" = 8.78831111111111, "latitude" = 54.903, "temperature" = 16.868310546875023, "u10" = 4.472952365875244, "v10" = -1.3958832025527954, "pressure" = 102426.1015625});

E2. Now try to query the new key:

SELECT
  data['humidity']
FROM
  demo.climate_data
WHERE
  data['humidity'] IS NOT NULL;

This makes it very easy to extend the data model, ideal for IoT applications where additional sensor information may become available.

And of course, with CrateDB all of this additional data will be indexed for great query performance.