Spatial SQL with PostGIS and Postgres: What makes spatial queries special?

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

Many of you already know that PostGIS is an extension to PostgreSQL which adds support for geographic objects, allowing some really powerful location queries to be run in PostgreSQL.

After you’ve installed PostGIS and enabled PostGIS by running the CREATE EXTENSION command in Postgres, you can load data into your Postgres database and begin experimenting with spatial SQL.

What do spatial queries have to offer in PostGIS? In this blog post, I’ve tried to pull together some tips and tricks for just that. These are tips that are either essential for spatial analytics (e.g. spatial joins) or are useful to understand (e.g. geometry validity) in this context.

There are many forms of spatial joins

With non-spatial data, performing a JOIN between tables usually requires a common key. For example, by collecting statistical information from municipalities, you can join them in PostgreSQL with a simple right join by using the municipality ID, without any spatial operations.

Spatial joins are likely the most common form of spatial analytics in Postgres (using PostGIS, of course). Location can be the only commonality between two pieces of data.

If you think about that for a while, it’s actually quite mind blowing.

If you have two sets of data, for example mobile phone locations and cafe locations, you can draw completely new information from the intersection of the two.

For example, assuming that the points on the map below represented GPS traces from mobile phones, you could determine which was the closest cafe (filtered from all the other buildings) at any given point in time.

map-gps-traces-mobile-phones-tjukanov-azure-postgres-blog-blue-purple.png

Figure 1: Map of GPS traces from mobile phones joined with map of city buildings, created with arbitrary data using QGIS

With spatial joins in PostGIS, the most typical functions are ST_Intersects, St_DWithin, and ST_Contains.

For ST_Intersects, you give two geometries as input. If any portion of geometry A intersects with geometry B, the function returns true. This function is the most common way of performing spatial joins. For an example of a simple use of ST_Intersects, see my blog post Finding the most “innovative” square kilometer in Europe with spatial SQL.
For ST_DWithin, you give two geometries and a distance as an input. If geometry B is within a given distance from geometry A, the function returns true.
ST_Contains works as you could guess from the name. You give two geometries as input, and if geometry A contains (completely) geometry B, the function returns true.

There are also many more variations on these: ST_Boundary, ST_ContainsProperly, ST_Covers, ST_CoveredBy, ST_Equals, etc. You can check for spatial joins in three dimensions (ST_3DIntersects, etc.) or for any other type of specific case you might have.

To better understand what happens behind spatial intersections, it is good to learn a bit about what PostGIS and Postgres people call the “Dimensionally Extended 9 Intersection Model” aka the DE9-IM matrix.

The most efficient way to take full advantage of the DE9-IM model in PostGIS is by using a function called ST_Relate. This is especially useful for testing compound checks of intersection, crosses, etc., in a single step.

Knowing your neighbour, a powerful spatial analytics question in PostGIS

Besides asking “where”, a typical question with spatial analytics is “what is the closest” or “how far” something is. In the term “K nearest neighbours” (KNN), “K” represents the number of neighbours you are seeking. The “nearest neighbour” search is different from a distant search in that there is no limit in terms of distance involved—the question is all about “who is closest?”

Where is my closest bar? Where are the closest schools? Where are the closest manholes in the sewage network?

According to the PostGIS documentation on KNN: “KNN is a pure, index-based nearest neighbour search. By going up and down the spatial index, the search can find the nearest candidate geometries without using any magical search radius numbers, so the technique is suitable and highly performant even for very large tables with highly variable data densities.”

To perform these queries, PostGIS has a handy <-> operator. When given two geometries as input, the <-> operator returns the 2D distance between the two. Use in the "ORDER BY" clause provides index-assisted nearest-neighbour result sets.

The <#> operator returns distance between two floating point bounding boxes instead, possibly reading them from a spatial index. So, the <#> operator brings you results faster, which you can then use for further estimation.

Read more about nearest neighbour searches in the PostGIS documentation here.

More operators!

Besides the <-> and <#> operators, there are much more handy operators that you can use to compare geometries. Some of these are related to the spatial joins explained earlier, and in some cases the operators can be used in place of those functions.

The PostGIS cheatsheet and official documentation provide examples for the operators listed in the following table:

Operator	Returns TRUE if:
&& (A, B)	A's 2D bounding box intersects B's 2D bounding box.
&&& (A, B)	A's 3D bounding box intersects B's 3D bounding box.
&< (A, B)	A's bounding box overlaps or is to the left of B's.
&<\| (A, B)	A's bounding box overlaps or is below B's.
&> (A, B)	A's bounding box overlaps or is to the right of B's.
<< (A, B)	A's bounding box is strictly to the left of B's.
<<\| (A, B)	A's bounding box is strictly below B's.
=	A's bounding box is the same as B's; uses double precision bounding box.
>> (A, B)	A's bounding box is strictly to the right of B's.
@ (A, B)	A's bounding box is contained by B's.
\|&> (A, B)	A's bounding box overlaps or is above B's.
\|>> (A, B)	A's bounding box is strictly above B's.
~ (A, B)	A's bounding box contains B's.
~= (A, B)	A's bounding box is the same as B's.

So, if you were to search for countries that are east of Finland, you could try the following query combined with a CTE (a common table expression):

WITH a AS (SELECT * FROM countries WHERE name = 'Finland'), 
      b AS (SELECT * FROM countries) 
    SELECT b.name, a.geom FROM  
     a, 
     b 
    WHERE a.geom >> b.geom

Clustering methods in PostGIS

There are many different functions in PostGIS to modify geometries, whether you are interpolating point on a line or searching for polygon centroids. But when trying to find patterns from noise, you need to investigate clustering.

PostGIS offers a few clustering functions out of the box that can work for different use cases. In the following example, I’ve taken 6752 building centroids in Helsinki from open data sources:

Helsinki-building-centroid-PostGIS-map-purple-by-tjukanov.png

ST_ClusterKMeans returns a distance-based, K-means cluster. With this typical clustering method, you provide a multigeometry as an input and define the number of clusters beforehand. I’ve decided to divide the building data into 10 clusters. The output or each cluster, when provided a unique color, looks like this:

Helsinki-building-centroid-PostGIS-map-10-clusters-multicolor-by-tjukanov.png

Though 10 was a random number, and you can see how it divided the data into some logical clusters. Picking a number that better applies to your scenario can benefit many uses cases, for example for polygon splitting.

Another approach is density-based spatial clustering of applications with noise (DBSCAN). This is available in PostGIS under the function ST_ClusterDBSCAN, which identifies clusters by grouping together entities that are within a distance “r” of each other such that a cluster has “m” points in it. If a cluster fails to meet the m points threshold, entities are classified as noise. Using this approach, our dataset image appears as shown:

Helsinki-building-centroid-PostGIS-map-density-based-spatial-cluster-by-tjukanov.png

You can also do clustering on indexes. PostgreSQL has a nice functionality called CLUSTER that reshuffles the data in a dataset based on a given index, so the rows with close index values are physically placed close to each other in the table. This results in better performance when you are loading data on a map and panning around.

Validate your geometries

Usually when your PostGIS query is giving an error, you’ve misspelled something.

The second most common issue with spatial data is a situation where your geometries are invalid. So how to tackle those situations when you face a ‘TopologyException’ error?

Validity is most important for polygons, which define bounded areas and require a good deal of structure. Lines are very simple and cannot be invalid, nor can points.

If you face any issues, it’s first good to check geometry validity with ST_IsValid and seek more information with ST_IsValidReason.

If you need even more information, ST_IsValidDetail will return you the following:

valid_detail row, formed by a boolean (valid), stating if a geometry is valid,
a varchar (reason) stating a reason why it is invalid
a geometry (location) pointing out where it is invalid.

Attempts to make broken geometries valid are a bit like geometry simplifying operations: your first try very rarely works. But still it’s worth trying the automated solution offered by PostGIS: ST_MakeValid. This function attempts to create a valid representation of a given invalid geometry without losing any of the input vertices.

Explode geometries with ST_Dump

You might already know about the basic geometry types in PostGIS: points, lines (called “linestrings” in PostGIS lingo), and polygons in a few different forms.

An overview of these PostGIS geometries is provided in the following picture:

Table-overview-PostGIS-geometrics-with-Point-Linestring-and-Polygon-as-columns.png

In addition to these basic geometry structures, there is also the PostGIS geometry_dump structure. Geometry_dump is composed of two properties: geom and path. The geom is a geometry and path is a multi-dimensional integer array that has the exact location of the geom in the MULTI, Geometry Collection. For MULTI geometries, the path is one dimensional, looking like {1}. For single geometries, the path is an empty array. For GeometryCollection, there are multiple array elements depending on the complexity of the Geometry Collection.

When the input geometry is a simple type (POINT,LINESTRING,POLYGON), a single record is returned with an empty path array and the input geometry as geom.

When the input geometry is a collection or multi, a record for each of the collection components is returned, and the path expresses the position of the component inside the collection.

You can see a simple example of the use of ST_Dump in my blog post, where I was searching for the most isolated point in Finland:

CREATE table emptyareas 
as  
SELECT path,geom, st_area(geom) AS area 
FROM(SELECT (a.geom).path[1] AS path,(a.geom).geom AS geom 
FROM(SELECT st_dump(st_union(geom)) AS geom 
FROM grid where total = 0) AS a) AS foo

So, ST_Dump offers you a flexible way to work with geometries.

Conclusions

As with other pieces of spatial SQL, I would advise you to understand the basics of PostGIS and then combine those skills with PostgreSQL magic, as is demonstrated in this example: Bulk Nearest Neighbour using Lateral Joins

Additional materials:

About the Author, Topi Tjukanov

I train and consult on PostGIS, QGIS, and other FOSS4G (= Free Open Source Software for Geospatial) at Gispo Ltd, in Finland. I have my personal portfolio website with a gallery of geospatial visualizations created with QGIS. And I occasionally will write useful PostGIS blog posts for the Azure Postgres blog, like this one!