Quadgrid Generation Using Recursive SQL Function

For something a little different, here is a PostGIS recursive SQL quadgrid function which has been in my toolbox for some time now. The inspiration came in 2010 when reading “Open Source GIS: A Grass GIS Approach” by Markus Neteler and Helena Mitasova (3rd edition, p.244). The quadgrid function works by recursively subdividing tiles into smaller tiles (quadcells) up to a maximum depth if say the number of intersecting points (or some other feature criteria) exceeds a certain threshold.

Quadgrids have many applications, but I’ve found them useful for mapping urban phenomena (such as population density), both in 2D and 3D. Since large cities tend to be packed with more people, the result is more quadcells packed into a given land area. Quadgrids are also computationally efficient, as a finer grid cell resolution is only used in the more populous areas. Below is an example of how I’ve used a population density quadgrid to create a 3D city skyline of Sydney. The taller the quadcells, the greater the number of people per square meter – no different to the actual built form of our cities where, if land is scarce, we build skyward. The recursive quadgrid function is available on github. It requires PostgreSQL9.3 or above.

The function is called in the same way as any other pl/pgsql function. For instance:

CREATE TABLE quadgrid AS
SELECT depth::integer, the_geom::geometry(Polygon, 3577) as wkb_geometry, cell_value::float8
FROM DE_RegularQuadGrid((SELECT wkb_geometry FROM abs_aus11_tiles_32k WHERE tid IN (17864)), ‘tutorials.abs_mb11_points’,’wkb_geometry’, 10, 1000);

The arguments of the quadgrid function are: the parent tile geometry, the name of the points feature table along with its geometry column name, the maximum depth of recursion, and the threshold number of points per cell above which a cell will sub-divide.

The function uses two helper functions, namely DE_MakeRegularQuadCells() and DE_MakeSquare(). They work in tandem by taking a square geometry and subdividing it into four child geometries.
https://github.com/dimensionaledge/cf_public/tree/master/shapes

I’ve also just refactored the quadgrid function to incorporate a new, highly efficient LATERAL ‘Point in Polygon’ code pattern which avoids the use of GROUP BY when counting points in polygons. The result is a 75% reduction in query times compared to the conventional CROSS JOIN and GROUP BY approach.

SELECT r.the_geom, r.pcount FROM
(SELECT DE_MakeRegularQuadCells(wkb_geometry) as the_geom FROM abs_aus11_tiles_32k WHERE tid IN (17864)) l,
LATERAL
(SELECT l.the_geom, count(*) as pcount FROM tutorials.abs_mb11_points WHERE ST_Intersects(l.the_geom, wkb_geometry) AND l.the_geom && wkb_geometry) r;

Quadgrid run times depend on the number of underlying point features, and the depth of recursion. In the above GIF animation, there are 1.86 million points alone that intersect with the parent tile, making it one of the most populous areas in Australia. For this exercise, the time to ‘quadgrid’ this one tile took 19.9 seconds to a depth of 10. Less populated tiles tend to take only fractions of a second.

1. ---------------------------------------------------------------------------
2. -- Code Desciption:
3. ---------------------------------------------------------------------------
4. -- PostgreSQL/PostGIS custom function for generating quadcells recursively from a given starting geometry and intersecting reference table, to a maximum number of iteration levels or threshold value per cell
5. -- Dependencies: DE_MakeSquare(), DE_MakeRegularQuadCells()
6. -- Developed by: mark[a]dimensionaledge[dot]com
7. -- Licence: GNU GPL version 3.0
8. ---------------------------------------------------------------------------
9.
10. DROP FUNCTION IF EXISTS DE_RegularQuadGrid(geometry, text, text, integer, double precision);
11. CREATE OR REPLACE FUNCTION DE_RegularQuadGrid(parent_geom geometry, reference_table text, reference_geom_col text, max_depth integer, threshold_value double precision)
12. RETURNS TABLE (depth integer, the_geom GEOMETRY, cell_value double precision)
13. AS \$\$
14. DECLARE
15. reference_geom_type text;
16.
17. BEGIN
18. EXECUTE 'SELECT GeometryType('|| reference_geom_col ||') FROM '|| reference_table ||' LIMIT 1' INTO reference_geom_type ;
19.
20. IF reference_geom_type NOT IN ('POINT')
21. THEN
22. RAISE EXCEPTION 'Reference table is not a valid geometry type';
23. ELSE
24. END IF;
25.
26. RETURN QUERY EXECUTE
27. 'WITH RECURSIVE quadcells (depth, the_geom, cell_value) AS (
28. --SEED THE PARENT GEOMETRY AND CELL VALUE
29. SELECT 1, l.the_geom, r.pcount
30. FROM (SELECT ST_GeomFromEWKT(ST_AsEWKT('|| quote_literal(CAST(parent_geom as text)) ||')) as the_geom) l,
31. LATERAL
32. (SELECT count(*) as pcount, l.the_geom FROM '|| reference_table ||' WHERE ST_Intersects(l.the_geom, '|| reference_geom_col ||') AND l.the_geom && '|| reference_geom_col ||') r
33. --RECURSIVE PART
34. UNION ALL
35. SELECT t.depth, t.the_geom, t.pcount
36. FROM
37. --TERMINAL CONDITION SUBQUERY LOOPS UNTIL THE CONDITIONS ARE NO LONGER MET - NOTE THE RECURSIVE ELEMENT CAN ONLY BE EXPLICITYLY REFERRED TO ONCE, HENCE THE USE OF CTE
38. (
39. WITH a AS (SELECT * FROM quadcells WHERE the_geom IS NOT NULL AND depth < '|| max_depth ||' AND cell_value > '|| threshold_value ||'),
40. b AS (SELECT max(depth) as previous FROM a),
41. c AS (SELECT a.* FROM a,b WHERE a.depth = b.previous),
42. d AS (SELECT r.the_geom, r.pcount FROM (SELECT DE_MakeRegularQuadCells(the_geom) as the_geom FROM c) l, LATERAL (SELECT count(*) as pcount, l.the_geom FROM '|| reference_table ||' WHERE ST_Intersects(l.the_geom, '|| reference_geom_col ||') AND l.the_geom && '|| reference_geom_col ||') r)
43. SELECT b.previous+1 as depth, d.the_geom, d.pcount FROM b, d
44. ) t
45. )
46. SELECT depth, the_geom, cell_value::float8 FROM quadcells WHERE ST_IsEmpty(the_geom)=false AND (cell_value <= '|| threshold_value ||' OR (cell_value > '|| threshold_value ||' AND depth = '|| max_depth||'))' ;
47.
48. END;
49. \$\$ LANGUAGE 'plpgsql' VOLATILE;
50.
1. ---------------------------------------------------------------------------
2. -- Code Desciption:
3. ---------------------------------------------------------------------------
4. -- PostgreSQL/PostGIS custom function for subdividing a square polygon into four child polygons
5. -- Dependencies: DE_MakeSquare()
6. -- Developed by: mark[at]dimensionaledge[dot]com
7. -- Licence: GNU GPL version 3.0
8. ---------------------------------------------------------------------------
9. -- Usage Example:
10. ---------------------------------------------------------------------------
11. -- SELECT DE_MakeRegularQuadCells(DE_MakeSquare(ST_MakePoint(0,0),1));
12. ---------------------------------------------------------------------------
13.
14. CREATE OR REPLACE FUNCTION DE_MakeRegularQuadCells(parent GEOMETRY)
15. RETURNS SETOF GEOMETRY
16. AS \$\$
17. DECLARE
18. halfside float8;
19. i INTEGER DEFAULT 1;
20. srid INTEGER;
21. centerpoint GEOMETRY;
22. centersquare GEOMETRY;
24.
25. BEGIN
26. srid := ST_SRID(parent);
27. centerpoint := ST_Centroid(parent);
28. halfside := abs(ST_Xmax(parent) - ST_Xmin(parent))/2;
29. centersquare := ST_ExteriorRing(DE_MakeSquare(centerpoint, halfside));
30.
31. WHILE i < 5 LOOP
32. quadcell := DE_MakeSquare(ST_PointN(centersquare, i), halfside);
33. RETURN NEXT quadcell;
34. i := i + 1;
35. END LOOP;
36.
37. RETURN;
38. END
39. \$\$ LANGUAGE 'plpgsql' IMMUTABLE;
40.
DE_MakeSquare.sql
1. ---------------------------------------------------------------------------
2. -- Code Desciption:
3. ---------------------------------------------------------------------------
4. -- PostgreSQL/PostGIS custom function for generating a square polygon of a specified size
5. -- Dependencies: nil
6. -- Developed by: mark[at]dimensionaledge[dot]com
7. -- Licence: GNU GPL version 3.0
8. ---------------------------------------------------------------------------
9. -- Usage Example:
10. ---------------------------------------------------------------------------
11. -- SELECT DE_MakeSquare(ST_MakePoint(0,0),1);
12. ---------------------------------------------------------------------------
13.
14. CREATE OR REPLACE FUNCTION DE_MakeSquare(centerpoint GEOMETRY, side FLOAT8)
15. RETURNS GEOMETRY
16. AS \$\$
17. SELECT ST_SetSRID(ST_MakePolygon(ST_MakeLine(
18. ARRAY[
19. st_makepoint(ST_X(centerpoint)-0.5*side, ST_Y(centerpoint)+0.5*side),
20. st_makepoint(ST_X(centerpoint)+0.5*side, ST_Y(centerpoint)+0.5*side),
21. st_makepoint(ST_X(centerpoint)+0.5*side, ST_Y(centerpoint)-0.5*side),
22. st_makepoint(ST_X(centerpoint)-0.5*side, ST_Y(centerpoint)-0.5*side),
23. st_makepoint(ST_X(centerpoint)-0.5*side, ST_Y(centerpoint)+0.5*side)
24. ]
25. )),ST_SRID(centerpoint));
26. \$\$ LANGUAGE 'sql' IMMUTABLE STRICT;
27.

Bezier Curves – A More Flexible Alternative to Great Circles

This tutorial introduces a flexible method for generating flight paths in PostGIS using a custom bezier curve function. Unlike Great Circles, bezier curves provide the user the ability to set the amount of ‘flex’ or ‘bend’ exhibited by the curve, the number of vertices (useful for timeseries animations), and a break value at a predefined meridian to ensure the visual integrity of the curve is maintained when plotting in your favourite projection. I also have a 3D version of the bezier curve function which I’ll share at a later date.

The full code for this tutorial, including the QGIS technique for centering a world map in the Pacific Ocean, is available on github here. I won’t repeat everything which is already documented in the github tutorial code and custom bezier curve function, except make mention of the five arguments used by the bezier curve function itself.

DE_BezierCurve2D(origin geometry, destination geometry, theta numeric, num_points integer, breakval numeric)

1. origin geometry
2. destination geometry
3. theta, which is the maximum bend in the East-West plane
4. the number of Linestring points or vertices
5. the break value which splits the Linestring into a Multilinestring at a specified meridian

Example: DE_BezierCurve2D(o_geom, d_geom, 25, 1000, 30)

Origin and destination geometries should be self-explanatory.

Theta defines the maximum degree of ‘flex’ or ‘bend’ in the East-West plane, to which a SIN function is applied to produce the effect of straighter lines in the North-South plane. If you prefer that all curves exhibit the same degree of bend irrespective of the azimuth between their origins and destinations, then modify the bezier function by replacing SIN(ST_Azimuth(ST_Point(o_x, o_y), ST_Point(d_x, d_y)) with the value ‘1’. Or if you want to stipulate a minimum bend in the North-South plane, then enclose the associated formula within a PostgreSQL ‘greatest()’ function.

The number of Linestring points or vertices is a really handy feature for animating flight paths where you need to synchronise movements along flight paths using a clock. To do this, set the number of points equal to the flight duration between each origin destination pair. So a 60 minute flight gets 60 points or vertices. A 120 minute flight gets 120 points or vertices. Then dump the flight path vertices into a points table which you can then index to a flight timetable.

The last argument is the break value, which is the meridian at which the flight path will be broken into two Linestrings (thus becoming a Multilinestring). The break value should equal the PROJ4 ‘+lon_0’ value associated with the CRS which you intend to view the data in, plus or minus 180 degrees. An example is given in the github tutorial file.

At some point I’ll release my 3D version of this function which is very cool for making 3D flight path animations.

Enjoy!

From Days to Minutes – Geoprocessing of Alberta Land Use Data

When writing my first blog post “Eating the Elephant In Small Bites”, a question on the PostGIS mailing list caught my eye because the challenge posed was not dissimilar to something I once encountered when working with Australian land use data. This time the dataset related to land cover in Alberta, Canada, and the PostGIS query using ST_Interection() was reportedly taking many, many days to process one particular class of land cover. Fortunately the Alberta dataset is freely available, so I thought to test my PostGIS vector tiling and Map Reduce code pattern – and with great effect! This post describes how I addressed the Alberta land class problem, building upon the concepts and practices introduced in my first blog post to deliver query times of just minutes. As with the “Eating the Elephant in Small Bites” tutorial, our machine setup for this exercise is a CentOS6.5 Linux server instance running on AWS, with 16vCPUs and 30GB of memory.

The full code for this solution can be downloaded from github here

Data Understanding

Upon downloading and ingesting the Alberta land cover dataset into PostGIS, the problem becomes self-evident. The land class ‘lc_class 34’ comprises the least number of geometries, but with an average of 119 rings per polygon.

SELECT lc_class, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points, round(SUM(ST_NRings(wkb_geometry))::numeric/SUM(ST_NumGeometries(wkb_geometry))::numeric,3) as rings_per_geom FROM lancover_polygons_2010 GROUP BY lc_class ORDER BY 1;

lc_class | num_geoms | num_points | rings_per_geom
----------+-----------+------------+----------------
20 |     83793 |    1306686 |          1.029
31 |      1262 |      33366 |          1.115
32 |      5563 |     291666 |          1.560
33 |     12231 |     198385 |          1.023
34 |       366 |    2646625 |        119.046
50 |    196681 |    4750590 |          1.165
110 |    154816 |    3137196 |          1.078
120 |     83069 |    2833293 |          1.282
210 |    150550 |    6260788 |          1.522
220 |    197666 |    4793592 |          1.150
230 |    112034 |    2419651 |          1.045
(11 rows)

Visually, this is how ‘lc_class 34’ appears in QGIS with green fill.  Upon closer inspection, we see ‘lc_class 34’ has a trellis-like structure where each contiguous block of white space (i.e. the “negative space”) is a polygon inner ring. And therein lies the problem. Like with multipolygons, complex polygons with inner rings are stored as arrays, meaning queries are often slowed by the nature of array operations, plus indexes are less effective than if each array element is dumped and stored as its own row. But luckily for us, there are standard PostGIS functions we can use to address this.

Solution

The solution to the Alberta ‘lc_class 34’ problem I will describe sequentially, starting with how to prepare the data by dumping the multipolygons and polyon rings, followed by a description of the fn_worker SQL code pattern. Finally we parallelise the code across all available cores.

Data preparation involves a two step “dumping” process, starting with ST_Dump() of the multipolygons, followed by ST_DumpRings() of the polygons dumped by step 1. Note that we record the path number of each polygon ring that is dumped. Exterior rings by default have a path value = 0, whilst interior rings have a path value > 0 signifying the interior ring number.

1. # Create PostGIS tables (with indexes) for the dumped polygons and their rings
2. SQL=\$(cat<<EOF
3. -------------------------
4. ------- SQL BLOCK -------
5. -------------------------
6. DROP TABLE IF EXISTS landcover_dumped_34;
7. CREATE TABLE landcover_dumped_34 (
8. fid serial,
9. ogc_fid integer,
10. wkb_geometry geometry(Polygon, 3400),
11. lc_class integer,
12. mod_ty text,
13. shape_length float8,
14. shape_area float8
15. );
16.
17. INSERT INTO landcover_dumped_34
18. SELECT
19. NEXTVAL('landcover_dumped_34_fid_seq'),
20. ogc_fid,
21. (ST_Dump(wkb_geometry)).geom,
22. lc_class,
23. mod_ty,
24. shape_length,
25. shape_area
26. FROM lancover_polygons_2010 WHERE lc_class = 34;
27.
28. DROP TABLE IF EXISTS landcover_dumped_34_rings;
29. CREATE TABLE landcover_dumped_34_rings (
30. fid serial,
31. ogc_fid integer,
32. path integer,
33. wkb_geometry geometry(Polygon, 3400)
34. );
35.
36. INSERT INTO landcover_dumped_34_rings
37. WITH s AS (SELECT ogc_fid, (ST_DumpRings(wkb_geometry)).path as path, ST_Buffer(ST_SnapToGrid((ST_DumpRings(wkb_geometry)).geom,0.1)) as the_geom FROM  landcover_dumped_34)  --SNAP and BUFFER TO MAKEVALID
38. SELECT
39. NEXTVAL('landcover_dumped_34_rings_fid_seq'),
40. ogc_fid,
41. path,
42. the_geom
43. FROM s;
44.
45. CREATE INDEX landcover_dumped_34_rings_wkb_geometry_idx ON landcover_dumped_34_rings USING GIST(wkb_geometry);
46. ANALYZE landcover_dumped_34_rings;
47.
48. -------------------------
49. EOF
50. )
51. echo "\$SQL"  # comment to suppress printing
52. # execute SQL STATEMENT or comment # to skip
53. psql -U \$username -d \$dbname -c "\$SQL"  ### alternatively, comment out line with single '#' to skip this step
# Create PostGIS tables (with indexes) for the dumped polygons and their rings
SQL=\$(cat<<EOF
-------------------------
------- SQL BLOCK -------
-------------------------
DROP TABLE IF EXISTS landcover_dumped_34;
CREATE TABLE landcover_dumped_34 (
fid serial,
ogc_fid integer,
wkb_geometry geometry(Polygon, 3400),
lc_class integer,
mod_ty text,
shape_length float8,
shape_area float8
);

INSERT INTO landcover_dumped_34
SELECT
NEXTVAL('landcover_dumped_34_fid_seq'),
ogc_fid,
(ST_Dump(wkb_geometry)).geom,
lc_class,
mod_ty,
shape_length,
shape_area
FROM lancover_polygons_2010 WHERE lc_class = 34;

DROP TABLE IF EXISTS landcover_dumped_34_rings;
CREATE TABLE landcover_dumped_34_rings (
fid serial,
ogc_fid integer,
path integer,
wkb_geometry geometry(Polygon, 3400)
);

INSERT INTO landcover_dumped_34_rings
WITH s AS (SELECT ogc_fid, (ST_DumpRings(wkb_geometry)).path as path, ST_Buffer(ST_SnapToGrid((ST_DumpRings(wkb_geometry)).geom,0.1)) as the_geom FROM  landcover_dumped_34)  --SNAP and BUFFER TO MAKEVALID
SELECT
NEXTVAL('landcover_dumped_34_rings_fid_seq'),
ogc_fid,
path,
the_geom
FROM s;

CREATE INDEX landcover_dumped_34_rings_wkb_geometry_idx ON landcover_dumped_34_rings USING GIST(wkb_geometry);
ANALYZE landcover_dumped_34_rings;

-------------------------
EOF
)
echo "\$SQL"  # comment to suppress printing
# execute SQL STATEMENT or comment # to skip
psql -U \$username -d \$dbname -c "\$SQL"  ### alternatively, comment out line with single '#' to skip this step

Next we generate a grid layout to serve as the basis for our tiles. For this exercise, we will use a 2km x 2km tile size as we wish to keep our "chopped" up geometries fairly small to enable fast queries of the 'lc_class 34' tiled geometries at a later date.

1. #######################################
2. #########  GRID CREATION  ##########
3. #######################################
4. # bash function for creating a regular vector grid in PostGIS
5. fn_generategrid() {
6. # get sql custom functions
7. wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/lattices/DE_RegularGrid.sql -O DE_RegularGrid.sql
8. wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/shapes/DE_MakeSquare.sql -O DE_MakeSquare.sql
9. # load sql custom functions
10. for i in *.sql; do
11. psql -U \$username -d \$dbname -f \$i
12. done
13.
14. SQL=\$(cat<<EOF
15. -------------------------
16. ------- SQL BLOCK -------
17. -------------------------
18. DROP TABLE IF EXISTS regular_grid_2k;
19. CREATE TABLE regular_grid_2k AS (
20. WITH s AS (SELECT DE_RegularGrid(ST_Envelope(ST_Collect(wkb_geometry)),2000) as wkb_geometry FROM abmiw2wlcv_48tiles)
21. SELECT row_number() over() as tid, wkb_geometry::geometry(Polygon, 3400) FROM s);
22.
23. -------------------------
24. EOF
25. )
26. echo "\$SQL"  # comment to suppress printing
27. # execute SQL STATEMENT or comment # to skip
28. psql -U \$username -d \$dbname -c "\$SQL"
29. }
30. # end of bash function
31.
32. # call the bash function or comment # to skip
33. fn_generategrid
34.
35. #######################################
#######################################
#########  GRID CREATION  ##########
#######################################
# bash function for creating a regular vector grid in PostGIS
fn_generategrid() {
# get sql custom functions
wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/lattices/DE_RegularGrid.sql -O DE_RegularGrid.sql
wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/shapes/DE_MakeSquare.sql -O DE_MakeSquare.sql
# load sql custom functions
for i in *.sql; do
psql -U \$username -d \$dbname -f \$i
done

SQL=\$(cat<<EOF
-------------------------
------- SQL BLOCK -------
-------------------------
DROP TABLE IF EXISTS regular_grid_2k;
CREATE TABLE regular_grid_2k AS (
WITH s AS (SELECT DE_RegularGrid(ST_Envelope(ST_Collect(wkb_geometry)),2000) as wkb_geometry FROM abmiw2wlcv_48tiles)
SELECT row_number() over() as tid, wkb_geometry::geometry(Polygon, 3400) FROM s);

-------------------------
EOF
)
echo "\$SQL"  # comment to suppress printing
# execute SQL STATEMENT or comment # to skip
psql -U \$username -d \$dbname -c "\$SQL"
}
# end of bash function

# call the bash function or comment # to skip
fn_generategrid

#######################################

We also create a vector tile table to hold the outputs of our worker function. Note that the worker function does the tiling of the exterior rings separately to the interior rings. The unions of each are taken before applying ST_Difference to remove the "negative space" of the interior rings from each tile.

1. #######################################
2. #####  DEFINE WORKER FUNCTION  ######
3. #######################################
4. # define the worker function to be executed across all cores
5. fn_worker (){
6. source \$dbsettings
7. SQL=\$(cat<<EOF
8. -------------------------
9. ----- SQL STATEMENT -----
10. -------------------------
11. INSERT INTO vector_tiles
12. WITH
13. f0 AS (
14.         SELECT
15.         tid,
16.         wkb_geometry as the_geom
17.         FROM regular_grid_2k
18.         WHERE tid >= \$1 AND tid < \$2
19.         ),
20. f1_p0 AS (
21.         SELECT
22.         f0.tid,
23.         CASE WHEN ST_Within(f0.the_geom,rt.wkb_geometry) THEN f0.the_geom
24.         ELSE ST_Intersection(f0.the_geom,rt.wkb_geometry) END as the_geom
25.         FROM f0, \$3 as rt
26.         WHERE ST_Intersects(f0.the_geom, rt.wkb_geometry) AND f0.the_geom && rt.wkb_geometry AND rt.path = 0
27.         ),
28. f1_p0u AS (
29.         SELECT
30.         tid,
31.         ST_Union(the_geom) as the_geom
32.         FROM f1_p0
33.         GROUP BY tid
34.         ),
35. f1_p1 AS (
36.         SELECT
37.         f0.tid,
38.         CASE WHEN ST_Within(f0.the_geom,rt.wkb_geometry) THEN f0.the_geom
39.         ELSE ST_Intersection(f0.the_geom,rt.wkb_geometry)
40.         END as the_geom
41.         FROM f0, \$3 as rt
42.         WHERE ST_Intersects(f0.the_geom, rt.wkb_geometry) AND f0.the_geom && rt.wkb_geometry AND rt.path > 0
43.         ),
44. f1_p1u AS (
45.         SELECT
46.         tid,
47.         ST_Union(the_geom) as the_geom
48.         FROM f1_p1
49.         GROUP BY tid
50.         ),
51. f2 AS (
52.         SELECT
53.         f1_p0u.tid,
54.         CASE WHEN  f1_p1u.tid IS NULL THEN f1_p0u.the_geom
55.         WHEN ST_IsEmpty(ST_Difference(f1_p0u.the_geom,f1_p1u.the_geom)) THEN NULL
56.         ELSE ST_Difference(f1_p0u.the_geom,f1_p1u.the_geom)
57.         END as the_geom
58.         FROM f1_p0u LEFT JOIN  f1_p1u
59.         ON f1_p0u.tid = f1_p1u.tid
60.         )
61. ------------------------
62. ---- result ----
63. ------------------------
64. SELECT
65. NEXTVAL('vector_tiles_fid_seq'),
66. tid,
67. ST_Multi(the_geom),
68. 1
69. FROM f2
70. WHERE the_geom IS NOT NULL;
71. -------------------------
72. EOF
73. )
74. echo "\$SQL"  # comment to suppress printing
75. # execute SQL STATEMENT
76. psql -U \$username -d \$dbname -c "\$SQL"
77. }
78. # end of worker function
79.
80. # make worker function visible to GNU Parallel across all cores
81. export -f fn_worker
82.
83. #######################################
#######################################
#####  DEFINE WORKER FUNCTION  ######
#######################################
# define the worker function to be executed across all cores
fn_worker (){
source \$dbsettings
SQL=\$(cat<<EOF
-------------------------
----- SQL STATEMENT -----
-------------------------
INSERT INTO vector_tiles
WITH
f0 AS (
SELECT
tid,
wkb_geometry as the_geom
FROM regular_grid_2k
WHERE tid >= \$1 AND tid < \$2
),
f1_p0 AS (
SELECT
f0.tid,
CASE WHEN ST_Within(f0.the_geom,rt.wkb_geometry) THEN f0.the_geom
ELSE ST_Intersection(f0.the_geom,rt.wkb_geometry) END as the_geom
FROM f0, \$3 as rt
WHERE ST_Intersects(f0.the_geom, rt.wkb_geometry) AND f0.the_geom && rt.wkb_geometry AND rt.path = 0
),
f1_p0u AS (
SELECT
tid,
ST_Union(the_geom) as the_geom
FROM f1_p0
GROUP BY tid
),
f1_p1 AS (
SELECT
f0.tid,
CASE WHEN ST_Within(f0.the_geom,rt.wkb_geometry) THEN f0.the_geom
ELSE ST_Intersection(f0.the_geom,rt.wkb_geometry)
END as the_geom
FROM f0, \$3 as rt
WHERE ST_Intersects(f0.the_geom, rt.wkb_geometry) AND f0.the_geom && rt.wkb_geometry AND rt.path > 0
),
f1_p1u AS (
SELECT
tid,
ST_Union(the_geom) as the_geom
FROM f1_p1
GROUP BY tid
),
f2 AS (
SELECT
f1_p0u.tid,
CASE WHEN  f1_p1u.tid IS NULL THEN f1_p0u.the_geom
WHEN ST_IsEmpty(ST_Difference(f1_p0u.the_geom,f1_p1u.the_geom)) THEN NULL
ELSE ST_Difference(f1_p0u.the_geom,f1_p1u.the_geom)
END as the_geom
FROM f1_p0u LEFT JOIN  f1_p1u
ON f1_p0u.tid = f1_p1u.tid
)
------------------------
---- result ----
------------------------
SELECT
NEXTVAL('vector_tiles_fid_seq'),
tid,
ST_Multi(the_geom),
1
FROM f2
WHERE the_geom IS NOT NULL;
-------------------------
EOF
)
echo "\$SQL"  # comment to suppress printing
# execute SQL STATEMENT
psql -U \$username -d \$dbname -c "\$SQL"
}
# end of worker function

# make worker function visible to GNU Parallel across all cores
export -f fn_worker

#######################################

Next we create a job list which we proceed to execute in parallel. And voilà!

1. #######################################
2. ##########  CREATE JOB LIST  ###########
3. #######################################
4. # create job list to feed GNU Parallel.
5. SQL=\$(cat<<EOF
6. -------------------------
7. ------- SQL BLOCK -------
8. -------------------------
9. -- create joblist where block size = 1000 tiles (i.e. tiles processed in batches of 1000)
10. COPY (SELECT i as lower, i+1000 as upper FROM generate_series(1,250000,1000) i) TO STDOUT WITH CSV;
11. -------------------------
12. EOF
13. )
14. echo "\$SQL"  # comment to suppress printing
15. # execute SQL STATEMENT
16. psql -U \$username -d \$dbname -c "\$SQL" > joblist.csv
17.
18. #######################################
19.
20. #######################################
21. ##########  EXECUTE JOBS  ###########
22. #######################################
23. cat joblist.csv | parallel --colsep ',' fn_worker {1} {2} landcover_dumped_34_rings
24. wait
25. #######################################
#######################################
##########  CREATE JOB LIST  ###########
#######################################
# create job list to feed GNU Parallel.
SQL=\$(cat<<EOF
-------------------------
------- SQL BLOCK -------
-------------------------
-- create joblist where block size = 1000 tiles (i.e. tiles processed in batches of 1000)
COPY (SELECT i as lower, i+1000 as upper FROM generate_series(1,250000,1000) i) TO STDOUT WITH CSV;
-------------------------
EOF
)
echo "\$SQL"  # comment to suppress printing
# execute SQL STATEMENT
psql -U \$username -d \$dbname -c "\$SQL" > joblist.csv

#######################################

#######################################
##########  EXECUTE JOBS  ###########
#######################################
cat joblist.csv | parallel --colsep ',' fn_worker {1} {2} landcover_dumped_34_rings
wait
#######################################

Results of Query Times

The results show the impact of a larger tile size on query run times, and the resultant complexity of the tiled geometries produced (measured by the rings_per_geom). Whilst quadrupling the area of the tile size (from 2x2 to 4x4) reduces the run time from 779 seconds to 282 seconds, the rings_per_geom increases - which is understandable given the trellis-like structure of the source data. Both scenarios offer a vast improvement on the many, many days it reportedly took to process the data using a standard ST_Intersection query. As to the acceptability of the tile-size trade-off, I think it really depends on the criticality of query times with respect to the downstream processes that will ultimately consume the tiled geometries. It thus becomes a matter of judgement.

2k x 2k tiles
TOTAL SCRIPT TIME: 779 (batch size =1000 tiles)
num_geoms | num_points | rings_per_geom
-----------+------------+----------------
144799 |    3471051 |          1.055

4k x 4k tiles
TOTAL SCRIPT TIME: 282 (batch size =250 tiles)
num_geoms | num_points | rings_per_geom
-----------+------------+----------------
57775 |    2966397 |          1.263 Conclusions

The Alberta land cover problem highlights a few of the challenges that analysts typically encounter when working with large, complex datasets. The solution presented here is a practical example of how vector tiling and Map Reduce concepts can deliver real geo-processing efficiency improvements. However the real business value of reducing query times - from days to minutes - is to accelerate the 'business time to insight' and to amplify the speed or volume of iterations the analyst can consider as part of the geospatial 'value discovery' process. As this exercise attests, the business impact that Spatial IT can make over traditional desktop GIS approaches in the context of ever-shortening decision windows is nothing short of 'game changing'. Welcome to the emerging world of Big Data.

Eating the Elephant in Small Bites – An Introduction to Vector Tiling and Map Reduce in PostGIS

“Stop trying to eat the elephant with one bite” This tutorial introduces two concepts which I’ve found invaluable when working with “large” datasets in PostGIS.  The first concept is that of vector tiling, a method whereby large geometries are chunked down into smaller geometries using a regular grid or tile structure to accelerate the processing times of computationally intensive queries.  The second concept is Map Reduce, a scalable programming paradigm for the processing and analysis of large datasets with parallel, distributed algorithms on multi-core platforms.

In this tutorial, we apply these techniques to variations of a dataset containing large multipolygons. We compare and contrast different scenarios for producing the tiles in order to measure the impact of different strategies on query times. Our findings show query times can be substantially reduced, in this case by a factor of 7, by first “dumping” each member of the polygon collection into its own row. A further 7-fold reduction in query times can be obtained by parallelising the query in batches of 500 tiles across all available cores.

We believe that the on-going development of parallel processing techniques for PostgreSQL/PostGIS can only serve to enhance the power of these tools in their dual capacity as a database backend and as a stand-alone geo-processing engine.

Our machine setup for this tutorial is a CentOS6.5 Linux instance running on AWS, with 16vCPUs and 30GB of memory. For Windows users and Linux users who prefer not to script in bash, the parallel execution component of this tutorial can be achieved using Python.

The full code for this tutorial can be downloaded from github here

Data Modelling Objectives

Vector tiles are a recent phenomena often applied within the context of web mapping. Much like raster tiles, vector tiles are simply vector representations of geographic data whereby vector features are clipped to the boundary of each tile. This makes the downstream processing or consumption of these features by a web mapping application, or as with the focus of this tutorial, a PostGIS query more efficient.

We will demonstrate a process for generating vector tiles in PostGIS using a dataset representing the Australian continent – but with a twist. We also require vector features representing marine waters. Put succinctly, our data modelling objectives are to create:

• unclipped tile geometries
• tiled features representing land (i.e. clipped to the coast)
• tiled features representing marine waters (i.e. the difference not represented by land) Data Preparation

The source dataset is a shapefile containing the SA2 statistical boundaries of Australia.  The shapefile is freely available for download from: here. The steps for downloading and ingesting the data into PostgreSQL are described below. Note that our database connection settings are stored as variables in a shell script called “current.sh” which is sourced at the top of the code block. I have functionalised parts of the code to allow one-off steps to be easily skipped (e.g. loading the data) in the event the script is being run multiple times. You will also see that within the SQL code block, we create three variations of the source table to enable us to measure the incremental performance gain by “dumping” the multipolygon elements pre-tiling.

1. #!/bin/bash
2. # A tutorial that introduces the concepts of vector tiling and Map Reduce in PostGIS
3. # Written by: mark[at]dimensionaledge[dot]com
4. # Licence: GNU GPL version 3.0
5. # Start timer
6. START_T1=\$(date +%s)
7.
8. # DB Connection Settings
9. dbsettings="/mnt/data/common/repos/cf_private/settings/current.sh"
10. export dbsettings
11. source \$dbsettings
12.
13. #######################################
14. ######### DATA PREPARATION ##########
15. #######################################
16. # bash function for downloading and ingesting required data into PostgreSQL
17. fn_getdata() {
18. wget "http://www.abs.gov.au/ausstats/subscriber.nsf/log?openagent&1270055001_sa2_2011_aust_shape.zip&1270.0.55.001&Data%20Cubes&7130A5514535C5FCCA257801000D3FBD&0&July%202011&23.12.2010&Latest" -O sa211.zip
19. unzip sa211.zip
20. # Load dataset into PostgreSQL
21. ogr2ogr -f "PostgreSQL" PG:"host=\$host user=\$username password=\$password dbname=\$dbname" SA2_2011_AUST.shp -nln  abs_sa211_multi -s_srs EPSG:4283 -t_srs EPSG:3577 -a_srs EPSG:3577 -nlt MULTIPOLYGON -overwrite
22. }
23. # end of bash function
24. # call the bash function or comment # to skip
25. fn_getdata
26.
27. # Create three additional PostGIS tables (with indexes) representing the union and dumped constituents of all SA2 geometries
28. SQL=\$(cat<<EOF
29. -------------------------
30. ------- SQL BLOCK -------
31. -------------------------
32. DROP TABLE IF EXISTS abs_sa211_dumped;
33. CREATE TABLE abs_sa211_dumped AS
34. SELECT row_number() over () as ogc_fid, sa2_main11::text as poly_id, (ST_Dump(wkb_geometry)).geom::geometry(Polygon, 3577) as wkb_geometry FROM abs_sa211_multi;
35. CREATE INDEX abs_sa211_dumped_geom_idx ON abs_sa211_dumped USING GIST(wkb_geometry);
36.
37. DROP TABLE IF EXISTS abs_aus11_multi;
38. CREATE TABLE abs_aus11_multi AS
39. SELECT 1 as ogc_fid, ST_Multi(ST_Union(wkb_geometry))::geometry(Multipolygon, 3577) as wkb_geometry FROM abs_sa211;
40. CREATE INDEX abs_aus11_multi_geom_idx ON abs_aus11_multi USING GIST(wkb_geometry);
41.
42. DROP TABLE IF EXISTS abs_aus11_dumped;
43. CREATE TABLE abs_aus11_dumped AS
44. SELECT row_number() over () as ogc_fid, (ST_Dump(wkb_geometry)).geom::geometry(Polygon, 3577) as wkb_geometry FROM abs_aus11_multi;
45. CREATE INDEX abs_aus11_dumped_geom_idx ON abs_aus11_dumped USING GIST(wkb_geometry);
46. -------------------------
47. EOF
48. )
49. echo "\$SQL"  # comment to suppress printing
50. # execute SQL STATEMENT or comment # to skip
51. #psql -U \$username -d \$dbname -c "\$SQL"  ### alternatively, comment out line with single '#' to skip this step
52.
53. #######################################
#!/bin/bash
# A tutorial that introduces the concepts of vector tiling and Map Reduce in PostGIS
# Written by: mark[at]dimensionaledge[dot]com
# Licence: GNU GPL version 3.0
# Start timer
START_T1=\$(date +%s)

# DB Connection Settings
dbsettings="/mnt/data/common/repos/cf_private/settings/current.sh"
export dbsettings
source \$dbsettings

#######################################
######### DATA PREPARATION ##########
#######################################
# bash function for downloading and ingesting required data into PostgreSQL
fn_getdata() {
wget "http://www.abs.gov.au/ausstats/subscriber.nsf/log?openagent&1270055001_sa2_2011_aust_shape.zip&1270.0.55.001&Data%20Cubes&7130A5514535C5FCCA257801000D3FBD&0&July%202011&23.12.2010&Latest" -O sa211.zip
unzip sa211.zip
# Load dataset into PostgreSQL
ogr2ogr -f "PostgreSQL" PG:"host=\$host user=\$username password=\$password dbname=\$dbname" SA2_2011_AUST.shp -nln  abs_sa211_multi -s_srs EPSG:4283 -t_srs EPSG:3577 -a_srs EPSG:3577 -nlt MULTIPOLYGON -overwrite
}
# end of bash function
# call the bash function or comment # to skip
fn_getdata

# Create three additional PostGIS tables (with indexes) representing the union and dumped constituents of all SA2 geometries
SQL=\$(cat<<EOF
-------------------------
------- SQL BLOCK -------
-------------------------
DROP TABLE IF EXISTS abs_sa211_dumped;
CREATE TABLE abs_sa211_dumped AS
SELECT row_number() over () as ogc_fid, sa2_main11::text as poly_id, (ST_Dump(wkb_geometry)).geom::geometry(Polygon, 3577) as wkb_geometry FROM abs_sa211_multi;
CREATE INDEX abs_sa211_dumped_geom_idx ON abs_sa211_dumped USING GIST(wkb_geometry);

DROP TABLE IF EXISTS abs_aus11_multi;
CREATE TABLE abs_aus11_multi AS
SELECT 1 as ogc_fid, ST_Multi(ST_Union(wkb_geometry))::geometry(Multipolygon, 3577) as wkb_geometry FROM abs_sa211;
CREATE INDEX abs_aus11_multi_geom_idx ON abs_aus11_multi USING GIST(wkb_geometry);

DROP TABLE IF EXISTS abs_aus11_dumped;
CREATE TABLE abs_aus11_dumped AS
SELECT row_number() over () as ogc_fid, (ST_Dump(wkb_geometry)).geom::geometry(Polygon, 3577) as wkb_geometry FROM abs_aus11_multi;
CREATE INDEX abs_aus11_dumped_geom_idx ON abs_aus11_dumped USING GIST(wkb_geometry);
-------------------------
EOF
)
echo "\$SQL"  # comment to suppress printing
# execute SQL STATEMENT or comment # to skip
#psql -U \$username -d \$dbname -c "\$SQL"  ### alternatively, comment out line with single '#' to skip this step

####################################### If we examine the underlying structure of our source tables, we find that each table looks fundamentally different in terms of the number of rows, geometries and points. The extreme case is the abs_aus11_multi table, which comprises one (1) row representing the entire Australian continent. The relevance of querying large multipolygons as well as other key differences will become clearer when we compare the query times for producing our vector tiles.

--abs_sa211_multi
SELECT COUNT(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_sa211_multi;
num_rows | num_geoms | num_points
----------+-----------+------------
2214 |      8917 |    4164582

--abs_sa211_dumped
SELECT COUNT(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_sa211_dumped;
num_rows | num_geoms | num_points
----------+-----------+------------
8917 |      8917 |    4164582

--abs_aus11_multi
SELECT COUNT(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_multi;
num_rows | num_geoms | num_points
----------+-----------+------------
1 |      6718 |    1622598

--abs_aus11_dumped
SELECT COUNT(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_dumped;
num_rows | num_geoms | num_points
----------+-----------+------------
6718 |      6718 |    1622598

Grid Creation

Next we create a 32km x 32km regular grid or lattice structure that becomes the stencil for each tile. There are many ways to create a lattice structure in PostGIS, but since we work with grids on an almost daily basis, we utilise two custom SQL functions that can also be freely downloaded from our public Github repository. We create a table "regular_grid_32k" to hold the geometries returned by the custom function DE_RegularGrid(), together with a tile_id (which we refer to as a "tid").

1. #######################################
2. #########  GRID CREATION  ##########
3. #######################################
4. # bash function for creating a regular vector grid in PostGIS
5. fn_generategrid() {
6. # get sql custom functions
7. wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/lattices/DE_RegularGrid.sql -O DE_RegularGrid.sql
8. wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/shapes/DE_MakeSquare.sql -O DE_MakeSquare.sql
9. # load sql custom functions
10. for i in *.sql; do
11. psql -U \$username -d \$dbname -f \$i
12. done
13. SQL=\$(cat<<EOF
14. -------------------------
15. ------- SQL BLOCK -------
16. -------------------------
17. DROP TABLE IF EXISTS regular_grid_32k;
18. CREATE TABLE regular_grid_32k AS (
19. WITH s AS (SELECT DE_RegularGrid(ST_Envelope(wkb_geometry),32000) as wkb_geometry FROM abs_aus11_multi)
20. SELECT row_number() over() as tid, wkb_geometry::geometry(Polygon, 3577) FROM s);
21. -------------------------
22. EOF
23. )
24. echo "\$SQL"  # comment to suppress printing
25. # execute SQL STATEMENT or comment # to skip
26. psql -U \$username -d \$dbname -c "\$SQL"
27. }
28. # end of bash function
29.
30. # call the bash function or comment # to skip
31. #fn_generategrid
32.
33. #######################################
#######################################
#########  GRID CREATION  ##########
#######################################
# bash function for creating a regular vector grid in PostGIS
fn_generategrid() {
# get sql custom functions
wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/lattices/DE_RegularGrid.sql -O DE_RegularGrid.sql
wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/shapes/DE_MakeSquare.sql -O DE_MakeSquare.sql
# load sql custom functions
for i in *.sql; do
psql -U \$username -d \$dbname -f \$i
done
SQL=\$(cat<<EOF
-------------------------
------- SQL BLOCK -------
-------------------------
DROP TABLE IF EXISTS regular_grid_32k;
CREATE TABLE regular_grid_32k AS (
WITH s AS (SELECT DE_RegularGrid(ST_Envelope(wkb_geometry),32000) as wkb_geometry FROM abs_aus11_multi)
SELECT row_number() over() as tid, wkb_geometry::geometry(Polygon, 3577) FROM s);
-------------------------
EOF
)
echo "\$SQL"  # comment to suppress printing
# execute SQL STATEMENT or comment # to skip
psql -U \$username -d \$dbname -c "\$SQL"
}
# end of bash function

# call the bash function or comment # to skip
#fn_generategrid

####################################### Tile Processing and Poor Man's Map Reduce

We now create an output table for the tile features to be produced in accordance with our modelling objectives. Each tile feature generated will be written to a single PostGIS table referenced by a unique identifier, a common tile_id, plus a feature category, 'fcat' which in this instance can take on a value of 1,2,3 to denote the feature type.

1. #######################################
2. #########  TILE OUTPUT TABLE  #########
3. #######################################
4. SQL=\$(cat<<EOF
5. -------------------------
6. ------- SQL BLOCK -------
7. -------------------------
8. DROP TABLE IF EXISTS vector_tiles;
9. CREATE TABLE vector_tiles (
10. fid serial,
11. tid integer,
12. wkb_geometry geometry(Multipolygon, 3577),
13. fcat integer
14. );
15. -------------------------
16. EOF
17. )
18. echo "\$SQL"  # comment to suppress printing
19. # execute SQL STATEMENT or comment # to skip
20. psql -U \$username -d \$dbname -c "\$SQL"
21.
22. #######################################
#######################################
#########  TILE OUTPUT TABLE  #########
#######################################
SQL=\$(cat<<EOF
-------------------------
------- SQL BLOCK -------
-------------------------
DROP TABLE IF EXISTS vector_tiles;
CREATE TABLE vector_tiles (
fid serial,
tid integer,
wkb_geometry geometry(Multipolygon, 3577),
fcat integer
);
-------------------------
EOF
)
echo "\$SQL"  # comment to suppress printing
# execute SQL STATEMENT or comment # to skip
psql -U \$username -d \$dbname -c "\$SQL"

#######################################

To populate the table, we must define a bash function "fn_worker" which contains a versatile SQL query pattern (a Poor Man's Map Reduce algorithm) that can be executed either as a single process, or in parallel without any need to modify code. We've written the SQL query using Common Table Expressions to enable subqueries to be used multiple times, as well as to assist overall readability.

Three bash variable substitutions are made at script run time: \$1 and \$2 define the limits of the tile blocks to be processed, whilst \$3 is the name of the source table from which we will generate the fcat1 features (recall we wish to compare how the underlying structure of the four source tables affect computation times).

Again we use 'psql' to execute the SQL statement. Note that the worker function is not limited to the execution of a single SQL statement. Any command line statement can be included in the function, which means varied and sophisticated geo-processing pipelines can be incorporated, combining tools such as GDAL, grass, R.

1. #######################################
2. #####  DEFINE WORKER FUNCTION  ######
3. #######################################
4. # define the worker function to be executed across all cores
5. fn_worker (){
6. source \$dbsettings
7. SQL=\$(cat<<EOF
8. -------------------------
9. ----- SQL STATEMENT -----
10. -------------------------
11. INSERT INTO vector_tiles
12. WITH
13. fcat0 AS (
14.         SELECT
15.         tid,
16.         wkb_geometry as the_geom
17.         FROM regular_grid_32k
18.         WHERE tid >= \$1 AND tid < \$2
19.         ),
20. fcat1 AS (
21.         SELECT
22.         fcat0.tid,
23.         CASE WHEN ST_Within(fcat0.the_geom,rt.wkb_geometry) THEN fcat0.the_geom
24.         ELSE ST_Intersection(fcat0.the_geom,rt.wkb_geometry) END as the_geom
25.         FROM fcat0, \$3 as rt
26.         WHERE ST_Intersects(fcat0.the_geom, rt.wkb_geometry) AND fcat0.the_geom && rt.wkb_geometry
27.         ),
28. fcat1u AS (
29.         SELECT
30.         tid,
31.         ST_Union(the_geom) as the_geom
32.         FROM fcat1
33.         GROUP BY tid
34.         ),
35. fcat2 AS (
36.         SELECT
37.         fcat0.tid,
38.         CASE WHEN ST_IsEmpty(ST_Difference(fcat0.the_geom,fcat1u.the_geom)) THEN NULL
39.         ELSE ST_Difference(fcat0.the_geom,fcat1u.the_geom) END as the_geom
40.         FROM fcat0, fcat1u
41.         WHERE fcat0.tid = fcat1u.tid
42.         )
43. ------------------------
44. ---- unclipped tile ----
45. ------------------------
46. SELECT
47. NEXTVAL('vector_tiles_fid_seq'),
48. tid,
49. ST_Multi(the_geom),
50. 0
51. FROM fcat0
52. WHERE the_geom IS NOT NULL
53. -------------------------
54. UNION ALL
55. -------------------------
56. ----  land features  ----
57. -------------------------
58. SELECT
59. NEXTVAL('vector_tiles_fid_seq'),
60. tid,
61. ST_Multi(the_geom),
62. 1
63. FROM fcat1u
64. WHERE the_geom IS NOT NULL
65. -------------------------
66. UNION ALL
67. -------------------------
68. ---- water features  ----
69. -------------------------
70. SELECT
71. NEXTVAL('vector_tiles_fid_seq'),
72. tid,
73. ST_Multi(the_geom),
74. 2
75. FROM fcat2
76. WHERE the_geom IS NOT NULL;
77. -------------------------
78. EOF
79. )
80. echo "\$SQL"  # comment to suppress printing
81. # execute SQL STATEMENT
82. psql -U \$username -d \$dbname -c "\$SQL"
83. }
84. # end of worker function
85.
86. # make worker function visible to GNU Parallel across all cores
87. export -f fn_worker
88.
89. #######################################
#######################################
#####  DEFINE WORKER FUNCTION  ######
#######################################
# define the worker function to be executed across all cores
fn_worker (){
source \$dbsettings
SQL=\$(cat<<EOF
-------------------------
----- SQL STATEMENT -----
-------------------------
INSERT INTO vector_tiles
WITH
fcat0 AS (
SELECT
tid,
wkb_geometry as the_geom
FROM regular_grid_32k
WHERE tid >= \$1 AND tid < \$2
),
fcat1 AS (
SELECT
fcat0.tid,
CASE WHEN ST_Within(fcat0.the_geom,rt.wkb_geometry) THEN fcat0.the_geom
ELSE ST_Intersection(fcat0.the_geom,rt.wkb_geometry) END as the_geom
FROM fcat0, \$3 as rt
WHERE ST_Intersects(fcat0.the_geom, rt.wkb_geometry) AND fcat0.the_geom && rt.wkb_geometry
),
fcat1u AS (
SELECT
tid,
ST_Union(the_geom) as the_geom
FROM fcat1
GROUP BY tid
),
fcat2 AS (
SELECT
fcat0.tid,
CASE WHEN ST_IsEmpty(ST_Difference(fcat0.the_geom,fcat1u.the_geom)) THEN NULL
ELSE ST_Difference(fcat0.the_geom,fcat1u.the_geom) END as the_geom
FROM fcat0, fcat1u
WHERE fcat0.tid = fcat1u.tid
)
------------------------
---- unclipped tile ----
------------------------
SELECT
NEXTVAL('vector_tiles_fid_seq'),
tid,
ST_Multi(the_geom),
0
FROM fcat0
WHERE the_geom IS NOT NULL
-------------------------
UNION ALL
-------------------------
----  land features  ----
-------------------------
SELECT
NEXTVAL('vector_tiles_fid_seq'),
tid,
ST_Multi(the_geom),
1
FROM fcat1u
WHERE the_geom IS NOT NULL
-------------------------
UNION ALL
-------------------------
---- water features  ----
-------------------------
SELECT
NEXTVAL('vector_tiles_fid_seq'),
tid,
ST_Multi(the_geom),
2
FROM fcat2
WHERE the_geom IS NOT NULL;
-------------------------
EOF
)
echo "\$SQL"  # comment to suppress printing
# execute SQL STATEMENT
psql -U \$username -d \$dbname -c "\$SQL"
}
# end of worker function

# make worker function visible to GNU Parallel across all cores
export -f fn_worker

#######################################

We now invoke GNU Parallel, a shell tool for executing jobs in parallel (or sequentially if needed). I prefer GNU Parallel over Python because of the pure simplicity by which one can apply Map Reduce concepts from the command line or bash.

To launch GNU Parallel, we must first construct a job list. My preferred method is to use a CSV file which can be generated dynamically using psql. A CSV file also provides us with an easy way to feed multiple arguments into each job (which the worker function will substitute accordingly).

In our case, we use the generate_series function in PostgreSQL to define how the tiles are to be chunked and processed. The SQL code block shows 3 variants, so be sure to leave only one line uncommented.

1. #######################################
2. ##########  CREATE JOB LIST  ###########
3. #######################################
4. # create job list to feed GNU Parallel.  The examples below show different ways of chunking the work.
5. SQL=\$(cat<<EOF
6. -------------------------
7. ------- SQL BLOCK -------
8. -------------------------
9. -- generate a single job comprising 24321 tiles (i.e. entire table processed on a single core)
10. COPY (SELECT i as lower, i+24321 as upper FROM generate_series(1,24321,24321) i) TO STDOUT WITH CSV;
11.
12. -- generate multiple jobs, each comprising 1 tile (i.e. 24321 jobs in total)
13. --COPY (SELECT i as lower, i+1 as upper FROM generate_series(1,24321,1) i) TO STDOUT WITH CSV;
14.
15. -- generate multiple jobs each comprising 100 tiles (i.e. tiles processed in batches of 100)
16. -- COPY (SELECT i as lower, i+100 as upper FROM generate_series(1,24321,100) i) TO STDOUT WITH CSV;
17. -------------------------
18. EOF
19. )
20. echo "\$SQL"  # comment to suppress printing
21. # execute SQL STATEMENT
22. psql -U \$username -d \$dbname -c "\$SQL" > joblist.csv
23.
24. #######################################
#######################################
##########  CREATE JOB LIST  ###########
#######################################
# create job list to feed GNU Parallel.  The examples below show different ways of chunking the work.
SQL=\$(cat<<EOF
-------------------------
------- SQL BLOCK -------
-------------------------
-- generate a single job comprising 24321 tiles (i.e. entire table processed on a single core)
COPY (SELECT i as lower, i+24321 as upper FROM generate_series(1,24321,24321) i) TO STDOUT WITH CSV;

-- generate multiple jobs, each comprising 1 tile (i.e. 24321 jobs in total)
--COPY (SELECT i as lower, i+1 as upper FROM generate_series(1,24321,1) i) TO STDOUT WITH CSV;

-- generate multiple jobs each comprising 100 tiles (i.e. tiles processed in batches of 100)
-- COPY (SELECT i as lower, i+100 as upper FROM generate_series(1,24321,100) i) TO STDOUT WITH CSV;
-------------------------
EOF
)
echo "\$SQL"  # comment to suppress printing
# execute SQL STATEMENT
psql -U \$username -d \$dbname -c "\$SQL" > joblist.csv

#######################################

The final step is to execute the joblist. The syntax for GNU Parallel is fairly straight forward. The left hand side of the "|" is the name of the joblist file which feeds into the "parallel" command. --colsep ',' tells GNU Parallel each column of the joblist CSV file is delimited by a comma. Next we instruct GNU parallel to call the worker function (in this case fn_worker). {n} are the joblist CSV column numbers that inform the worker function what values to substitute for \$1,\$2 when launching a process. Finally, "abs_sa211_dumped" is the name of the source table that the worker function will substitute for \$3. And that is it.

1. #######################################
2. ##########  EXECUTE JOBS  ###########
3. #######################################
4. cat joblist.csv | parallel --colsep ',' fn_worker {1} {2} abs_sa211_multi
5. wait
6. #######################################
7.
8. # stop timer
9. END_T1=\$(date +%s)
10. TOTAL_DIFF=\$(( \$END_T1 - \$START_T1 ))
11. echo "TOTAL SCRIPT TIME: \$TOTAL_DIFF"
12.
13. # end of script
14. #######################################
#######################################
##########  EXECUTE JOBS  ###########
#######################################
cat joblist.csv | parallel --colsep ',' fn_worker {1} {2} abs_sa211_multi
wait
#######################################

# stop timer
END_T1=\$(date +%s)
TOTAL_DIFF=\$(( \$END_T1 - \$START_T1 ))
echo "TOTAL SCRIPT TIME: \$TOTAL_DIFF"

# end of script
#######################################

Results of Query Times We can draw several insights from our results.

1. The time to vector tile the two abs_sa211 tables are orders of magnitude faster than the abs_aus11 tables. The reason for this is that abs_sa211 tables consist of smaller polygon geometries than the abs_aus11 tables, and that the ST_Intersection and union of small features is much faster than ST_Intersection of big features. The "eating the elephant in small bites" principle. The very fast "single process" query times for abs_sa211 tables also suggest there is little advantage to be gained by parallelising the tiling process for abs_sa211 with GNU Parallel.
2. Regardless of whether you use a single process or GNU parallel, the dumping of multipolygon elements using ST_Dump is good practice that can produce significant performance gains. When tiling the abs_aus11 tables, the "dumping" step alone gave in excess of a 7x performance improvement. Multipolygon elements are stored as arrays, and are represented by one large bounding box. Hence indexes on large multipolygons are less effective than if each polygon element is dumped and stored as its own row.
3. A single query process can at times be faster than a GNU Parallel process that treats each tile as a separate job (i.e. batch size = 1). However this may not always hold true, as with abs_aus11_dumped.
4. Tile batch size is an important pipeline tuning parameter. Rather than treat each tile as a separate job, it is faster to process the tiles in batches. For this tutorial, 500 tiles per batch produced the fastest overall query time. However this may not hold true for every use case. When processing big, computationally complex queries and data pipelines (such as producing Open Street Map routable vector tiles), we find the processing of individual tiles as separate jobs to be much faster and easier than in batch.
5. Another tuning parameter which we haven't mentioned is the tile size itself. This tutorial assumed a 32km x 32km tile size. Generally speaking, smaller tile sizes will take longer to process than larger tile sizes, in which GNU Parallel and multiple cores become advantageous.
Conclusions

PostGIS and the postgis community provides us with an amazing tool and support base that we'd be at a loss without. When learning PostGIS, we tend to invest in great books and incrementally read up on the spatial functions relevant to each use case. But in the emerging era of big data, a topic I believe deserves more attention is that of query "scalability". Often queries that perform "okay" on small datasets, or "pilots" do not always scale well on big datasets - in which case the query patterns themselves can act as "pipeline bottlenecks" in technical speak, or "project showstoppers" in business speak.

This tutorial introduces several techniques for efficiently working with big datasets in PostGIS that easily lend themselves to "scaling out" in a multicore platform environment. However understanding the performance of different query designs and patterns is a critical pre-cursor to the solution scaling phase where often small and "acceptable" inefficiencies can take on ugly new complexions.

Personally I find "Eating the Elephant in Small Bites" an apt and memorable adage that quintessentially captures the essence of vector tiling and query parallelisation in PostGIS. There are many additional factors and techniques to consider when applying these methods in practice (such as substituting the use of a regular grid with a quadgrid function that sees the vector tiles recursively subdivide in size to a given depth or feature count). But I took my own advice, and decided not to "try to eat the elephant in one bite". The topic is simply too broad to do the discussion justice in one post. I nevertheless hope that this tutorial will stimulate thinking whilst raising awareness amongst postgis community members of the possibilities offered by vector tiling and Map Reduce concepts.

In my next post, I will apply the techniques introduced in this tutorial to efficiently vector tile an Alberta Land Use dataset that is beset with large, highly complex polygons - a query run time that can be reduced from many days to several minutes.

Syndicate to my PostGIS feed

http://www.dimensionaledge.com/tag/postgis/feed