Live Stream on Jan 23rd: Unlocking Real Time Insights in the Renewable Energy Sector with CrateDB

Register now
Skip to content
Blog

Geospatial Queries with CrateDB

This article is more than 4 years old

CrateDB is a simple to use database backend for many applications. Nowadays a huge percentage of data is geo-tagged already, no matter whether it comes from statically placed sensors (e.g. weather stations) or from constantly moving mobile devices (e.g. smart phones).

CrateDB provides a specific data type for scenarios where it is necessary to track and analyze geo coordinates. The geo_point type can be used to store a latitude and longitude value pair in a single entity. Columns with the geo_point are represented and inserted using an array of doubles or WKT string representation:

[ <lon>, <lat> ]

or:

'POINT( <lon> <lat> )'

See GEO_POINT documentation.

Note that there is no comma between longitude and latitude in the WKT syntax!

Data Structure

Let's assume we have an application that stores pings from mobile devices that can send any payload to the server. The schema might look something like this:

create table ping (
    device_id string primary key,
    ts timestamp primary key, 
    location geo_point, 
    accuracy float, 
    speed float, 
    payload object(dynamic) as ( device_time timestamp, ... )
) clustered by (device_id) into 6 shards; 

In the real world you'd probably want to think about sharding and replication a bit, but we want to stay away from that in this example to keep it simple ;-) The really interesting thing to look at here is the column location which is of type geo_point.

Insert Data Sets

As mentioned earlier already we can use both WKT literal and array syntax to insert a geo_point.

insert into ping (
    device_id, ts, location, accuracy, speed
) values (
    '642a57d6-0c30-418c-a4ec-6e10a9454e2e',
    1405503872397,
    'POINT(9.744417 47.413417)',
    100.0,
    0.0
); 

is the same as:

insert into ping (
    device_id, ts, location, accuracy, speed
) values (
    '642a57d6-0c30-418c-a4ec-6e10a9454e2e',
    1405503872397,
    [9.744417, 47.413417],
    100.0,
    0.0
); 

Both techniques however require longitude as their first and latitude as their second position. This is different to the commonly used [lat,lon] representation of geo coordinates (e.g. Google Maps)!

Geo Queries

Let's assume we've inserted thousands of pings from multiple devices over a certain period of time already. You probably want to analyse the location data or at least display it in some way. You will agree with me that this is the fun part :) CrateDB provides two scalar functions which can utilise the geo_point data type: distance and within. With these 2 functions you'll be able to cover a lot of use cases already.

distance(point1, point2)

The distance function can be used to obtain the distance between two geo points on earth. It will return the orthodromic distance of the two coordinates in metres. See docs Let's take a look at an example using our existing data structure from above. You can get nearby devices by querying pings whose distance to a given point is less than a certain value, e.g. 1,000m within a point in London.

  select device_id,
         distance(location, 'POINT(-0.1275 51.507222)') as distance
    from pings
   where distance(location, 'POINT(-0.1275 51.507222)') < 1000
order by distance; 

Again, here we can use both the WKT string literal and array syntax. Since the distance returns the value in metres, we can use the built-in arithmetical operators to convert the distance into miles.

select device_id,
       distance(location, [-0.1275, 51.507222])/1000.0 as distance_in_km,
       distance(location, [-0.1275, 51.507222])/1609.34 as distance_in_miles
  from pings
 where distance(location, [-0.1275, 51.507222]) > 100000
   and distance(location, [-0.1275, 51.507222]) < 160934; 

This query would give us all devices within an annulus shape where the inner radius is 100km and the outer radius 100miles (~161km). Another great use of distance is in combination with avg, min and max scalar functions, for example if you want to measure the average distance between 2 devices. First you'll need to get the location of the first device and then you can apply the average function on the returned distance value. Assuming the first device location is 'POINT( 13.383333 52.516667 )' (somewhere in Berlin).

select avg(distance(location, 'POINT(13.383333 52.516667)')) as avg_distance
  from pings
 where device_id='642a57d6-0c30-418c-a4ec-6e10a9454e2e'; 

The distance function covers already a lot of use cases, but sometimes you need something more sophisticated. If you want to know if a ping comes from a certain country or city, drawing a circle around a point is not accurate enough. This is where shapes and the within scalar function comes into play.

within(shape1, shape2)

The within function determines whether a shape or point is located within another shape. See docs Using our data structure again, we can fetch all pings that happened from devices located within Austria. To do that we first need to know the geo shape of Austria. Since most data is provided as GeoJSON you'll need to convert it into a WKT polygon so you can use it with CrateDB.

Convert GeoJSON to WKT polygon https://gist.github.com/chaudum/1bd42ed71647af616676

cat AUT.geo.json | python geojson2polygon.py 
select count(*) from pings
 where within(location, 'POLYGON((
         16.979667 48.123497, 16.903754 47.714866, 16.340584 47.712902,
         16.534268 47.496171, 16.202298 46.852386, 16.011664 46.683611,
         15.137092 46.658703, 14.632472 46.431817, 13.806475 46.509306,
         12.376485 46.767559, 12.153088 47.115393, 11.164828 46.941579,
         11.048556 46.751359, 10.442701 46.893546,  9.932448 46.920728,
          9.47997 47.10281,    9.632932 47.347601,  9.594226 47.525058,
          9.896068 47.580197, 10.402084 47.302488, 10.544504 47.566399,
         11.426414 47.523766, 12.141357 47.703083, 12.62076 47.672388, 
         12.932627 47.467646, 13.025851 47.637584, 12.884103 48.289146,
         13.243357 48.416115, 13.595946 48.877172, 14.338898 48.555305,
         14.901447 48.964402, 15.253416 49.039074, 16.029647 48.733899,
         16.499283 48.785808, 16.960288 48.596982, 16.879983 48.470013,
         16.979667 48.123497
       ))'); 

That query will return the number of pings that happened from within Austria. Although Austria has a simple shape and the level of detail of the provided polygon is not very high, you can see that queries with shapes quickly become long and unclear. However, you probably use a CrateDB client and not SQL directly, so you can use parameter substitution. E.g. for Python there are several libraries that can convert GeoJSON into WKT and vice versa. Here's a pseudo code Python style example:

shape = json.loads(geojson_string)
cursor.execute("""
    select count(*) 
      from pings
     where within(location, ?)
""", (geojson_to_wkt(shape),));
cursor.fetchall() 

Note, that there is still a limitation to the within function: Multi-polygons are not supported. This does effect for example countries that have multiple island and therefore consist of multiple polygons. You can work around this problem by using the OR operator and multiple shapes in the WHERE clause.

Conclusion

You may say that CrateDB's support for geo queries is basic at the moment, but I think that the geo_point data type in combination with the distance and within scalar functions do actually cover a lot of real life use cases. It is just a matter of how you use the given functionality.

Geo functions described in this blog post are available in CrateDB version 0.40 and higher.

We would love to hear, how you use the geo support in CrateDB for your projects. Let us know!