This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.
GPS has become part of our daily life. GPS is in cars for navigation, in smartphones helping us to find places, and more recently GPS has been helping us to avoid getting infected by COVID-19. Managing and analyzing mobility tracks is the core of my work. My group in Université libre de Bruxelles specializes in mobility data management. We build an open source database system for spatiotemporal trajectories, called MobilityDB. MobilityDB adds support for temporal and spatio-temporal objects to the Postgres database and its spatial extension, PostGIS. If you’re not yet familiar with spatiotemporal trajectories, not to worry, we’ll walk through some movement trajectories for a public transport bus in just a bit.
One of my team’s projects is to develop a distributed version of MobilityDB. This is where we came in touch with the Citus extension to Postgres and the Citus engineering team. This post presents issues and solutions for distributed query processing of movement trajectory data. GPS is the most common source of trajectory data, but the ideas in this post also apply to movement trajectories collected by other location tracking sensors, such as radar systems for aircraft, and AIS systems for sea vessels.
As a start, let's explore the main concepts of trajectory data management, so you can see how to analyze geospatial movement trajectories.
The following animated gif shows a geospatial trajectory of a public transport bus1 that goes nearby an advertising billboard. What if you wanted to assess the visibility of the billboard to the passengers in the bus? If you can do this for all billboards and vehicles, then you would be able to extract interesting insights for advertising agencies to price the billboards, and for advertisers who are looking to optimize their campaigns.
In the animated gif above, we simply assume that if the bus is within 30 meters to the billboard, then it is visible to its passengers. This “visibility” is indicated in the animation by the yellow flash around the billboard when the bus is within 30 meters of the billboard.
How to measure the billboard’s visibility to a moving bus using a database query?
Let’s prepare a toy PostGIS database that minimally represents the example in the previous animated gif—and then gradually develop an SQL query to assess the billboard visibility to the passengers in a moving bus.
If you are not familiar with PostGIS, it is probably the most popular extension to Postgres and is used for storing and querying spatial data. For the sake of this post, all you need to know is that PostGIS extends Postgres with data types for geometry point, line, and polygon. PostGIS also defines functions to measure the distance between geographic features and to test topological relationships such as intersection.
In the SQL code block below, first you create the PostGIS extension. And then you will create two tables:
The database is visualized in the map below. You can see that the gpsPoint table has points of two bus trips, trip 1 in blue and trip 2 in red. In the table, each point has a timestamp. The two billboards are the gray diamonds in the map.
Your next step is to find the locations where a bus is within 30 meters from a billboard—and also the durations, i.e., how long the moving bus is within 30 meters of the billboard.
This PostGIS query above does not solve the problem. Yes, the condition in the
WHERE clause finds the GPS points that are within 30 meters from a billboard. But the PostGIS query does not tell the duration of this event.
Furthermore, imagine that point 4 in trip 1 (the blue trip) was not given. This query would have then returned null. The problem with this query is that it does not deal with the continuity of the bus trip, i.e. the query does not deal with the movement trajectory of the bus.
We need to reconstruct a continuous movement trajectory out of the given GPS points. Below is another PostGIS query that would find both the locations of the billboard’s visibility to the passengers in the bus, and also the duration of how long the billboard was visible to the bus passengers.
Yes, the above PostGIS query is a rather complex one. We split the query into multiple common table expressions CTEs, to make it readable. In Postgres, CTEs give you the ability to “name” a subquery, to make it easier to write SQL queries consisting of multiple steps.
- The first CTE,
pointPairin Lines 1-7, uses the window function
lead, in order to pack every pair of consecutive points, that belong to the same bus trip, into one tuple.
- This is a preparation for the second CTE,
segmentin Lines 7-12, which then connects the two points with a line segment. This step can be seen as a linear interpolation of the path between every two GPS points.
The result of these two CTEs can be visualized in the map below:
Then the third CTE, approach Lines 12-18, finds the locations where the bus starts/ends to be within 30 meters from the billboard. This is done by drawing a circular ring with 30 meters diameter around the billboard, and intersecting it with the segments of the bus trajectory. We thus get the two points in the map below, marked with the black cross.
The last step in the earlier PostGIS query, Lines 19-22, computes the time at these two points using linear referencing, that is assuming a constant speed per segment2.
Exercise: try to find a simpler way to express the PostGIS query displayed earlier. I couldn’t. :)
The PostGIS query had to be that complex, because it programs two non-trivial concepts:
- Continuous movement trajectory: while the GPS data is discrete, we had to reconstruct the continuous movement trajectory.
- Spatiotemporal proximity: the continuous movement trajectory was used to find the location and time (i.e., spatiotemporal) during which the bus was within 30 meters from the billboard.
The good news for you is that MobilityDB can help make it easier to analyze these types of movement trajectories. MobilityDB is an extension of PostgreSQL and PostGIS that has implemented these spatiotemporal concepts as custom types and functions in Postgres.
MobilityDB: a moving object database system for Postgres & PostGIS
Let’s have a look on how to express this PostGIS query more simply using MobilityDB. Here is how the previous PostGIS query would be expressed in MobilityDB.
What you need to know about the MobilityDB query above:
- The table
busTriphas the attribute trip of type
tgeompoint. It is the MobilityDB type for storing a complete trajectory.
- The nesting of
tdwithin->atValue->getTimewill return the time periods during which a bus trip has been within a distance of 30 meters to a billboard.
- The function
atperiodsetwill restrict the bus trip to only these time periods.
- The function
astextconverts the coordinates in the output to textual format.
- Accordingly, the result shows the part of the bus trip that starts at 2020-04-21 08:38:12.507515+02 and ends at 08:38:38.929465+02.
The MobilityDB documentation describes all of MobilityDB’s operations.
Now we step back, and show the creation of the
- The first step above is to create the MobilityDB extension in the database. In Postgres, the
CASCADEoption results in executing the same statement on all the dependencies. In the query above—because PostGIS is a dependency of MobilityDB—
CASCADEwill also create the PostGIS extension, if it has not yet been created.
- The second query above creates the
busTriptable with two attributes
(tripID int, trip tgeompoint).
tgeompointis the MobilityDB type to represent a movement trajectory. The
tgeompointattribute is constructed from a temporally sorted array of instants, each of which is a pair of a spatial point and a timestamp. This construction is expressed in the query above by the nesting of
tgeompointinst -> array_agg -> tgeompointseq.
- The last
SELECTquery above shows that the
busTriptable contains two tuples, corresponding to the two trips. Every trip has the format
[point1@time1, point2@time2, ...].
Bigger than an elephant: how to query movement trajectories at scale, when a single Postgres node won’t do
As we now have two working solutions for measuring the billboard visibility: one in PostGIS and another one in MobilityDB, the next natural move is to apply these solutions to a big database of all bus trips in Brussels in the last year, and all billboards in Brussels. This amounts to roughly 5 million bus trips (roughly 5 billion GPS points) and a few thousand billboards. This size goes beyond what a single Postgres node can handle. Hence, we need to distribute the Postgres database.
This is a job for Citus, the extension to Postgres that transforms Postgres into a distributed database. Efficiently distributing the complex PostGIS query with many CTEs is a challenge we'll leave to the Citus engineering team.
What I want to discuss here is the distribution of the MobilityDB query. Citus does not know the types and operations of MobilityDB. So the distribution is limited by what Citus can do in general for custom types and functions. My colleague, Mohamed Bakli, has done this assessment and published it in a paper titled “Distributed moving object data management in MobilityDB” in the ACM BigSpatial workshop (preprint) and in a demo paper titled “Distributed Mobility Data Management in MobilityDB” in the IEEE MDM conference (preprint).
The papers presented a solution to distribute MobilityDB using Citus. All the nodes in the Citus database cluster had PostgreSQL, PostGIS, MobilityDB, and Citus installed. The goal was to assess to what extent the spatiotemporal functions in MobilityDB can be distributed.
To do this assessment, the BerlinMOD benchmark (a tool for comparing moving object databases) was used. BerlinMOD consists of a trajectory data generator, and 17 benchmark queries that assess the features of a moving object database system. It was possible to execute 13 of the 17 BerlinMOD benchmark queries on a MobilityDB database cluster that is managed by Citus, without special customization.
See also the illuminating blog post about using custom types with Citus & Postgres, by Nils Dijk.
Back to our MobilityDB billboard visibility query, our mission is to calculate the billboard visibility for all billboards and all common transport vehicles in Brussels for an entire year.
We had set up a Citus database cluster, and created the MobilityDB extension in all its nodes. Then we used the Citus `create_distributed_table` function to distribute the
busTrip table across all the worker nodes in the Citus database cluster. Next we made the billboard table a Citus reference table, and copied the reference table to all the worker nodes.
Here is the resulting distributed query plan:
The Citus distributed query executor parallelizes the query over all workers in the Citus cluster. Every node also has the MobilityDB extension, which means we can use MobilityDB functions such as
dwithin in the query and in the indexes. Here for example, we see that the SP-GiST index on the Citus worker is used to efficiently evaluate the
WHERE dwithin(...) clause.
With this, we come to the end of this post. To sum up, this post has two main takeaways:
If you’re ever looking to analyze movement trajectories to understand the spatiotemporal interaction of things across space and time, you now have a few new (open source!) options in your Postgres and PostGIS toolbox:
- MobilityDB can help you to manage and analyze geospatial (e.g. GPS, radar) movement trajectories in PostgreSQL.
- MobilityDB + Citus open source work together out of the box, so you can analyze geospatial movement trajectories at scale, too. Just add the two Postgres extensions (along with PostGIS) into your Postgres database, and you are ready to manage big geospatial trajectory datasets.
- Curious about the source of this data? The trajectory is for line 71 in Brussels, when it goes in front of my university campus ULB Solbosch. The public transport company in Brussels publishes an open API, where all the trajectories of their vehicles can be probed https://opendata.stib-mivb.be. The billboard location was invented by me, and the background map comes from OpenStreetMap. ↩
- It remains to compute the visibility duration, i.e., the difference in seconds between the two timestamps, which can be done by another CTE and window functions. Not to further complicate the query, we skip this detail here. ↩