Category: Blog


Goodbye Dimensional Edge, Hello Cartovera!

This is my last blog post.

The Cartovera project has been brewing for 12 months now, and it’s now time to lift the veil.
Check out www.cartovera.com

Please contact me via the cartovera website.

 

Eating the Elephant in Small Bites – An Introduction to Vector Tiling and Map Reduce in PostGIS

“Stop trying to eat the elephant with one bite”

elephant

This tutorial introduces two concepts which I’ve found invaluable when working with “large” datasets in PostGIS.  The first concept is that of vector tiling, a method whereby large geometries are chunked down into smaller geometries using a regular grid or tile structure to accelerate the processing times of computationally intensive queries.  The second concept is Map Reduce, a scalable programming paradigm for the processing and analysis of large datasets with parallel, distributed algorithms on multi-core platforms.

In this tutorial, we apply these techniques to variations of a dataset containing large multipolygons. We compare and contrast different scenarios for producing the tiles in order to measure the impact of different strategies on query times. Our findings show query times can be substantially reduced, in this case by a factor of 7, by first “dumping” each member of the polygon collection into its own row. A further 7-fold reduction in query times can be obtained by parallelising the query in batches of 500 tiles across all available cores.

We believe that the on-going development of parallel processing techniques for PostgreSQL/PostGIS can only serve to enhance the power of these tools in their dual capacity as a database backend and as a stand-alone geo-processing engine.

Our machine setup for this tutorial is a CentOS6.5 Linux instance running on AWS, with 16vCPUs and 30GB of memory. For Windows users and Linux users who prefer not to script in bash, the parallel execution component of this tutorial can be achieved using Python.

The full code for this tutorial can be downloaded from github here

Data Modelling Objectives

Vector tiles are a recent phenomena often applied within the context of web mapping. Much like raster tiles, vector tiles are simply vector representations of geographic data whereby vector features are clipped to the boundary of each tile. This makes the downstream processing or consumption of these features by a web mapping application, or as with the focus of this tutorial, a PostGIS query more efficient.

We will demonstrate a process for generating vector tiles in PostGIS using a dataset representing the Australian continent – but with a twist. We also require vector features representing marine waters. Put succinctly, our data modelling objectives are to create:

  • unclipped tile geometries
  • tiled features representing land (i.e. clipped to the coast)
  • tiled features representing marine waters (i.e. the difference not represented by land)
Data Preparation

The source dataset is a shapefile containing the SA2 statistical boundaries of Australia.  The shapefile is freely available for download from: here. The steps for downloading and ingesting the data into PostgreSQL are described below. Note that our database connection settings are stored as variables in a shell script called “current.sh” which is sourced at the top of the code block. I have functionalised parts of the code to allow one-off steps to be easily skipped (e.g. loading the data) in the event the script is being run multiple times. You will also see that within the SQL code block, we create three variations of the source table to enable us to measure the incremental performance gain by “dumping” the multipolygon elements pre-tiling.

  1. #!/bin/bash
  2. # A tutorial that introduces the concepts of vector tiling and Map Reduce in PostGIS
  3. # Written by: mark[at]dimensionaledge[dot]com
  4. # Licence: GNU GPL version 3.0
  5. # Start timer
  6. START_T1=$(date +%s)
  7.  
  8. # DB Connection Settings
  9. dbsettings="/mnt/data/common/repos/cf_private/settings/current.sh"
  10. export dbsettings
  11. source $dbsettings
  12.  
  13. #######################################
  14. ######### DATA PREPARATION ##########
  15. #######################################
  16. # bash function for downloading and ingesting required data into PostgreSQL
  17. fn_getdata() {
  18. wget "http://www.abs.gov.au/ausstats/subscriber.nsf/log?openagent&1270055001_sa2_2011_aust_shape.zip&1270.0.55.001&Data%20Cubes&7130A5514535C5FCCA257801000D3FBD&0&July%202011&23.12.2010&Latest" -O sa211.zip
  19. unzip sa211.zip
  20. # Load dataset into PostgreSQL
  21. ogr2ogr -f "PostgreSQL" PG:"host=$host user=$username password=$password dbname=$dbname" SA2_2011_AUST.shp -nln  abs_sa211_multi -s_srs EPSG:4283 -t_srs EPSG:3577 -a_srs EPSG:3577 -nlt MULTIPOLYGON -overwrite
  22. }
  23. # end of bash function
  24. # call the bash function or comment # to skip
  25. fn_getdata
  26.  
  27. # Create three additional PostGIS tables (with indexes) representing the union and dumped constituents of all SA2 geometries
  28. SQL=$(cat<<EOF      
  29. -------------------------        
  30. ------- SQL BLOCK -------        
  31. -------------------------        
  32. DROP TABLE IF EXISTS abs_sa211_dumped;       
  33. CREATE TABLE abs_sa211_dumped AS         
  34. SELECT row_number() over () as ogc_fid, sa2_main11::text as poly_id, (ST_Dump(wkb_geometry)).geom::geometry(Polygon, 3577) as wkb_geometry FROM abs_sa211_multi;         
  35. CREATE INDEX abs_sa211_dumped_geom_idx ON abs_sa211_dumped USING GIST(wkb_geometry);         
  36.  
  37. DROP TABLE IF EXISTS abs_aus11_multi;        
  38. CREATE TABLE abs_aus11_multi AS      
  39. SELECT 1 as ogc_fid, ST_Multi(ST_Union(wkb_geometry))::geometry(Multipolygon, 3577) as wkb_geometry FROM abs_sa211;      
  40. CREATE INDEX abs_aus11_multi_geom_idx ON abs_aus11_multi USING GIST(wkb_geometry);       
  41.  
  42. DROP TABLE IF EXISTS abs_aus11_dumped;       
  43. CREATE TABLE abs_aus11_dumped AS         
  44. SELECT row_number() over () as ogc_fid, (ST_Dump(wkb_geometry)).geom::geometry(Polygon, 3577) as wkb_geometry FROM abs_aus11_multi;      
  45. CREATE INDEX abs_aus11_dumped_geom_idx ON abs_aus11_dumped USING GIST(wkb_geometry);         
  46. -------------------------        
  47. EOF
  48. )
  49. echo "$SQL"  # comment to suppress printing
  50. # execute SQL STATEMENT or comment # to skip             
  51. #psql -U $username -d $dbname -c "$SQL"  ### alternatively, comment out line with single '#' to skip this step
  52.  
  53. #######################################
#!/bin/bash
# A tutorial that introduces the concepts of vector tiling and Map Reduce in PostGIS
# Written by: mark[at]dimensionaledge[dot]com
# Licence: GNU GPL version 3.0
# Start timer
START_T1=$(date +%s)

# DB Connection Settings
dbsettings="/mnt/data/common/repos/cf_private/settings/current.sh"
export dbsettings
source $dbsettings

#######################################
######### DATA PREPARATION ##########
#######################################
# bash function for downloading and ingesting required data into PostgreSQL
fn_getdata() {
wget "http://www.abs.gov.au/ausstats/subscriber.nsf/log?openagent&1270055001_sa2_2011_aust_shape.zip&1270.0.55.001&Data%20Cubes&7130A5514535C5FCCA257801000D3FBD&0&July%202011&23.12.2010&Latest" -O sa211.zip
unzip sa211.zip
# Load dataset into PostgreSQL
ogr2ogr -f "PostgreSQL" PG:"host=$host user=$username password=$password dbname=$dbname" SA2_2011_AUST.shp -nln  abs_sa211_multi -s_srs EPSG:4283 -t_srs EPSG:3577 -a_srs EPSG:3577 -nlt MULTIPOLYGON -overwrite
}
# end of bash function
# call the bash function or comment # to skip
fn_getdata 

# Create three additional PostGIS tables (with indexes) representing the union and dumped constituents of all SA2 geometries
SQL=$(cat<<EOF 	 	
-------------------------	 	 
------- SQL BLOCK -------	 	 
-------------------------	 	 
DROP TABLE IF EXISTS abs_sa211_dumped;	 	 
CREATE TABLE abs_sa211_dumped AS	 	 
SELECT row_number() over () as ogc_fid, sa2_main11::text as poly_id, (ST_Dump(wkb_geometry)).geom::geometry(Polygon, 3577) as wkb_geometry FROM abs_sa211_multi;	 	 
CREATE INDEX abs_sa211_dumped_geom_idx ON abs_sa211_dumped USING GIST(wkb_geometry);	 	 

DROP TABLE IF EXISTS abs_aus11_multi;	 	 
CREATE TABLE abs_aus11_multi AS	 	 
SELECT 1 as ogc_fid, ST_Multi(ST_Union(wkb_geometry))::geometry(Multipolygon, 3577) as wkb_geometry FROM abs_sa211;	 	 
CREATE INDEX abs_aus11_multi_geom_idx ON abs_aus11_multi USING GIST(wkb_geometry);	 	 

DROP TABLE IF EXISTS abs_aus11_dumped;	 	 
CREATE TABLE abs_aus11_dumped AS	 	 
SELECT row_number() over () as ogc_fid, (ST_Dump(wkb_geometry)).geom::geometry(Polygon, 3577) as wkb_geometry FROM abs_aus11_multi;	 	 
CREATE INDEX abs_aus11_dumped_geom_idx ON abs_aus11_dumped USING GIST(wkb_geometry);	 	 
-------------------------	 	 
EOF
)
echo "$SQL"  # comment to suppress printing
# execute SQL STATEMENT or comment # to skip		 	 
#psql -U $username -d $dbname -c "$SQL"  ### alternatively, comment out line with single '#' to skip this step

#######################################

aus11

If we examine the underlying structure of our source tables, we find that each table looks fundamentally different in terms of the number of rows, geometries and points. The extreme case is the abs_aus11_multi table, which comprises one (1) row representing the entire Australian continent. The relevance of querying large multipolygons as well as other key differences will become clearer when we compare the query times for producing our vector tiles.

--abs_sa211_multi
SELECT COUNT(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_sa211_multi;
 num_rows | num_geoms | num_points 
----------+-----------+------------
     2214 |      8917 |    4164582

--abs_sa211_dumped
SELECT COUNT(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_sa211_dumped;
 num_rows | num_geoms | num_points 
----------+-----------+------------
     8917 |      8917 |    4164582

--abs_aus11_multi
SELECT COUNT(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_multi;
 num_rows | num_geoms | num_points 
----------+-----------+------------
        1 |      6718 |    1622598

--abs_aus11_dumped
SELECT COUNT(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_dumped;
 num_rows | num_geoms | num_points 
----------+-----------+------------
     6718 |      6718 |    1622598

 

Grid Creation

Next we create a 32km x 32km regular grid or lattice structure that becomes the stencil for each tile. There are many ways to create a lattice structure in PostGIS, but since we work with grids on an almost daily basis, we utilise two custom SQL functions that can also be freely downloaded from our public Github repository. We create a table "regular_grid_32k" to hold the geometries returned by the custom function DE_RegularGrid(), together with a tile_id (which we refer to as a "tid").

  1. #######################################
  2. #########  GRID CREATION  ##########
  3. #######################################
  4. # bash function for creating a regular vector grid in PostGIS
  5. fn_generategrid() {
  6. # get sql custom functions
  7. wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/lattices/DE_RegularGrid.sql -O DE_RegularGrid.sql
  8. wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/shapes/DE_MakeSquare.sql -O DE_MakeSquare.sql
  9. # load sql custom functions
  10. for i in *.sql; do
  11. psql -U $username -d $dbname -f $i
  12. done
  13. SQL=$(cat<<EOF      
  14. -------------------------        
  15. ------- SQL BLOCK -------        
  16. -------------------------
  17. DROP TABLE IF EXISTS regular_grid_32k;
  18. CREATE TABLE regular_grid_32k AS (
  19. WITH s AS (SELECT DE_RegularGrid(ST_Envelope(wkb_geometry),32000) as wkb_geometry FROM abs_aus11_multi)
  20. SELECT row_number() over() as tid, wkb_geometry::geometry(Polygon, 3577) FROM s);
  21. -------------------------        
  22. EOF
  23. )
  24. echo "$SQL"  # comment to suppress printing
  25. # execute SQL STATEMENT or comment # to skip       
  26. psql -U $username -d $dbname -c "$SQL"
  27. }
  28. # end of bash function
  29.  
  30. # call the bash function or comment # to skip
  31. #fn_generategrid
  32.  
  33. #######################################
#######################################
#########  GRID CREATION  ##########
#######################################
# bash function for creating a regular vector grid in PostGIS
fn_generategrid() {
# get sql custom functions
wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/lattices/DE_RegularGrid.sql -O DE_RegularGrid.sql
wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/shapes/DE_MakeSquare.sql -O DE_MakeSquare.sql
# load sql custom functions
for i in *.sql; do
psql -U $username -d $dbname -f $i
done
SQL=$(cat<<EOF 	 	
-------------------------	 	 
------- SQL BLOCK -------	 	 
-------------------------
DROP TABLE IF EXISTS regular_grid_32k;
CREATE TABLE regular_grid_32k AS (
WITH s AS (SELECT DE_RegularGrid(ST_Envelope(wkb_geometry),32000) as wkb_geometry FROM abs_aus11_multi)
SELECT row_number() over() as tid, wkb_geometry::geometry(Polygon, 3577) FROM s);
-------------------------	 	 
EOF
)
echo "$SQL"  # comment to suppress printing
# execute SQL STATEMENT or comment # to skip	 	
psql -U $username -d $dbname -c "$SQL"
}
# end of bash function

# call the bash function or comment # to skip
#fn_generategrid 

#######################################

grid

Tile Processing and Poor Man's Map Reduce

We now create an output table for the tile features to be produced in accordance with our modelling objectives. Each tile feature generated will be written to a single PostGIS table referenced by a unique identifier, a common tile_id, plus a feature category, 'fcat' which in this instance can take on a value of 1,2,3 to denote the feature type.

  1. #######################################
  2. #########  TILE OUTPUT TABLE  #########
  3. #######################################
  4. SQL=$(cat<<EOF      
  5. -------------------------        
  6. ------- SQL BLOCK -------        
  7. -------------------------
  8. DROP TABLE IF EXISTS vector_tiles;
  9. CREATE TABLE vector_tiles (
  10. fid serial,
  11. tid integer,
  12. wkb_geometry geometry(Multipolygon, 3577),
  13. fcat integer
  14. );
  15. -------------------------
  16. EOF
  17. )
  18. echo "$SQL"  # comment to suppress printing
  19. # execute SQL STATEMENT or comment # to skip               
  20. psql -U $username -d $dbname -c "$SQL"
  21.  
  22. #######################################
#######################################
#########  TILE OUTPUT TABLE  #########
#######################################
SQL=$(cat<<EOF 	 	
-------------------------	 	 
------- SQL BLOCK -------	 	 
-------------------------
DROP TABLE IF EXISTS vector_tiles;
CREATE TABLE vector_tiles (
fid serial,
tid integer,
wkb_geometry geometry(Multipolygon, 3577),
fcat integer
);
------------------------- 
EOF
)
echo "$SQL"  # comment to suppress printing
# execute SQL STATEMENT or comment # to skip		  	  	
psql -U $username -d $dbname -c "$SQL"

#######################################

To populate the table, we must define a bash function "fn_worker" which contains a versatile SQL query pattern (a Poor Man's Map Reduce algorithm) that can be executed either as a single process, or in parallel without any need to modify code. We've written the SQL query using Common Table Expressions to enable subqueries to be used multiple times, as well as to assist overall readability.

Three bash variable substitutions are made at script run time: $1 and $2 define the limits of the tile blocks to be processed, whilst $3 is the name of the source table from which we will generate the fcat1 features (recall we wish to compare how the underlying structure of the four source tables affect computation times).

Again we use 'psql' to execute the SQL statement. Note that the worker function is not limited to the execution of a single SQL statement. Any command line statement can be included in the function, which means varied and sophisticated geo-processing pipelines can be incorporated, combining tools such as GDAL, grass, R.

  1. #######################################
  2. #####  DEFINE WORKER FUNCTION  ######
  3. #######################################
  4. # define the worker function to be executed across all cores
  5. fn_worker (){
  6. source $dbsettings
  7. SQL=$(cat<<EOF
  8. -------------------------
  9. ----- SQL STATEMENT -----
  10. -------------------------
  11. INSERT INTO vector_tiles
  12. WITH
  13. fcat0 AS (
  14.         SELECT
  15.         tid,
  16.         wkb_geometry as the_geom
  17.         FROM regular_grid_32k
  18.         WHERE tid >= $1 AND tid < $2
  19.         ),
  20. fcat1 AS (
  21.         SELECT
  22.         fcat0.tid,
  23.         CASE WHEN ST_Within(fcat0.the_geom,rt.wkb_geometry) THEN fcat0.the_geom
  24.         ELSE ST_Intersection(fcat0.the_geom,rt.wkb_geometry) END as the_geom
  25.         FROM fcat0, $3 as rt
  26.         WHERE ST_Intersects(fcat0.the_geom, rt.wkb_geometry) AND fcat0.the_geom && rt.wkb_geometry
  27.         ),
  28. fcat1u AS (
  29.         SELECT
  30.         tid,
  31.         ST_Union(the_geom) as the_geom
  32.         FROM fcat1
  33.         GROUP BY tid
  34.         ),
  35. fcat2 AS (
  36.         SELECT
  37.         fcat0.tid,
  38.         CASE WHEN ST_IsEmpty(ST_Difference(fcat0.the_geom,fcat1u.the_geom)) THEN NULL
  39.         ELSE ST_Difference(fcat0.the_geom,fcat1u.the_geom) END as the_geom
  40.         FROM fcat0, fcat1u
  41.         WHERE fcat0.tid = fcat1u.tid
  42.         )
  43. ------------------------
  44. ---- unclipped tile ----
  45. ------------------------
  46. SELECT
  47. NEXTVAL('vector_tiles_fid_seq'),
  48. tid,
  49. ST_Multi(the_geom),
  50. 0
  51. FROM fcat0
  52. WHERE the_geom IS NOT NULL
  53. -------------------------
  54. UNION ALL
  55. -------------------------
  56. ----  land features  ----
  57. -------------------------
  58. SELECT
  59. NEXTVAL('vector_tiles_fid_seq'),
  60. tid,
  61. ST_Multi(the_geom),
  62. 1
  63. FROM fcat1u
  64. WHERE the_geom IS NOT NULL
  65. -------------------------
  66. UNION ALL
  67. -------------------------
  68. ---- water features  ----
  69. -------------------------
  70. SELECT
  71. NEXTVAL('vector_tiles_fid_seq'),
  72. tid,
  73. ST_Multi(the_geom),
  74. 2
  75. FROM fcat2
  76. WHERE the_geom IS NOT NULL;
  77. -------------------------
  78. EOF
  79. )
  80. echo "$SQL"  # comment to suppress printing
  81. # execute SQL STATEMENT
  82. psql -U $username -d $dbname -c "$SQL"
  83. }
  84. # end of worker function
  85.  
  86. # make worker function visible to GNU Parallel across all cores
  87. export -f fn_worker
  88.  
  89. #######################################
#######################################
#####  DEFINE WORKER FUNCTION  ######
#######################################
# define the worker function to be executed across all cores
fn_worker (){
source $dbsettings
SQL=$(cat<<EOF
-------------------------
----- SQL STATEMENT -----
-------------------------
INSERT INTO vector_tiles
WITH
fcat0 AS (
		SELECT
		tid,
		wkb_geometry as the_geom
		FROM regular_grid_32k
		WHERE tid >= $1 AND tid < $2
		),
fcat1 AS (
		SELECT
		fcat0.tid,
		CASE WHEN ST_Within(fcat0.the_geom,rt.wkb_geometry) THEN fcat0.the_geom
		ELSE ST_Intersection(fcat0.the_geom,rt.wkb_geometry) END as the_geom
		FROM fcat0, $3 as rt
		WHERE ST_Intersects(fcat0.the_geom, rt.wkb_geometry) AND fcat0.the_geom && rt.wkb_geometry
		),
fcat1u AS (
		SELECT
		tid,
		ST_Union(the_geom) as the_geom
		FROM fcat1
		GROUP BY tid
		),
fcat2 AS (
		SELECT
		fcat0.tid,
		CASE WHEN ST_IsEmpty(ST_Difference(fcat0.the_geom,fcat1u.the_geom)) THEN NULL
		ELSE ST_Difference(fcat0.the_geom,fcat1u.the_geom) END as the_geom
		FROM fcat0, fcat1u
		WHERE fcat0.tid = fcat1u.tid
		)
------------------------
---- unclipped tile ----
------------------------
SELECT
NEXTVAL('vector_tiles_fid_seq'),
tid,
ST_Multi(the_geom),
0
FROM fcat0
WHERE the_geom IS NOT NULL
-------------------------
UNION ALL
-------------------------
----  land features  ----
-------------------------
SELECT
NEXTVAL('vector_tiles_fid_seq'),
tid,
ST_Multi(the_geom),
1
FROM fcat1u
WHERE the_geom IS NOT NULL
-------------------------
UNION ALL
-------------------------
---- water features  ----
-------------------------
SELECT
NEXTVAL('vector_tiles_fid_seq'),
tid,
ST_Multi(the_geom),
2
FROM fcat2
WHERE the_geom IS NOT NULL;
-------------------------
EOF
)
echo "$SQL"  # comment to suppress printing
# execute SQL STATEMENT
psql -U $username -d $dbname -c "$SQL"
}
# end of worker function

# make worker function visible to GNU Parallel across all cores
export -f fn_worker

#######################################

We now invoke GNU Parallel, a shell tool for executing jobs in parallel (or sequentially if needed). I prefer GNU Parallel over Python because of the pure simplicity by which one can apply Map Reduce concepts from the command line or bash.

To launch GNU Parallel, we must first construct a job list. My preferred method is to use a CSV file which can be generated dynamically using psql. A CSV file also provides us with an easy way to feed multiple arguments into each job (which the worker function will substitute accordingly).

In our case, we use the generate_series function in PostgreSQL to define how the tiles are to be chunked and processed. The SQL code block shows 3 variants, so be sure to leave only one line uncommented.

  1. #######################################
  2. ##########  CREATE JOB LIST  ###########
  3. #######################################
  4. # create job list to feed GNU Parallel.  The examples below show different ways of chunking the work.
  5. SQL=$(cat<<EOF      
  6. -------------------------        
  7. ------- SQL BLOCK -------        
  8. -------------------------
  9. -- generate a single job comprising 24321 tiles (i.e. entire table processed on a single core)
  10. COPY (SELECT i as lower, i+24321 as upper FROM generate_series(1,24321,24321) i) TO STDOUT WITH CSV;
  11.  
  12. -- generate multiple jobs, each comprising 1 tile (i.e. 24321 jobs in total)
  13. --COPY (SELECT i as lower, i+1 as upper FROM generate_series(1,24321,1) i) TO STDOUT WITH CSV;
  14.  
  15. -- generate multiple jobs each comprising 100 tiles (i.e. tiles processed in batches of 100)
  16. -- COPY (SELECT i as lower, i+100 as upper FROM generate_series(1,24321,100) i) TO STDOUT WITH CSV;      
  17. -------------------------
  18. EOF
  19. )        
  20. echo "$SQL"  # comment to suppress printing
  21. # execute SQL STATEMENT    
  22. psql -U $username -d $dbname -c "$SQL" > joblist.csv
  23.  
  24. #######################################
#######################################
##########  CREATE JOB LIST  ###########
#######################################
# create job list to feed GNU Parallel.  The examples below show different ways of chunking the work.
SQL=$(cat<<EOF 	 	
-------------------------	 	 
------- SQL BLOCK -------	 	 
-------------------------
-- generate a single job comprising 24321 tiles (i.e. entire table processed on a single core)
COPY (SELECT i as lower, i+24321 as upper FROM generate_series(1,24321,24321) i) TO STDOUT WITH CSV;

-- generate multiple jobs, each comprising 1 tile (i.e. 24321 jobs in total)
--COPY (SELECT i as lower, i+1 as upper FROM generate_series(1,24321,1) i) TO STDOUT WITH CSV;

-- generate multiple jobs each comprising 100 tiles (i.e. tiles processed in batches of 100)
-- COPY (SELECT i as lower, i+100 as upper FROM generate_series(1,24321,100) i) TO STDOUT WITH CSV; 	 
-------------------------
EOF
)	 	 
echo "$SQL"  # comment to suppress printing
# execute SQL STATEMENT 	
psql -U $username -d $dbname -c "$SQL" > joblist.csv

#######################################

The final step is to execute the joblist. The syntax for GNU Parallel is fairly straight forward. The left hand side of the "|" is the name of the joblist file which feeds into the "parallel" command. --colsep ',' tells GNU Parallel each column of the joblist CSV file is delimited by a comma. Next we instruct GNU parallel to call the worker function (in this case fn_worker). {n} are the joblist CSV column numbers that inform the worker function what values to substitute for $1,$2 when launching a process. Finally, "abs_sa211_dumped" is the name of the source table that the worker function will substitute for $3. And that is it.

  1. #######################################
  2. ##########  EXECUTE JOBS  ###########
  3. #######################################
  4. cat joblist.csv | parallel --colsep ',' fn_worker {1} {2} abs_sa211_multi
  5. wait
  6. #######################################
  7.  
  8. # stop timer
  9. END_T1=$(date +%s)
  10. TOTAL_DIFF=$(( $END_T1 - $START_T1 ))
  11. echo "TOTAL SCRIPT TIME: $TOTAL_DIFF"
  12.  
  13. # end of script
  14. #######################################
#######################################
##########  EXECUTE JOBS  ###########
#######################################
cat joblist.csv | parallel --colsep ',' fn_worker {1} {2} abs_sa211_multi
wait
#######################################

# stop timer
END_T1=$(date +%s)
TOTAL_DIFF=$(( $END_T1 - $START_T1 ))
echo "TOTAL SCRIPT TIME: $TOTAL_DIFF"

# end of script
#######################################

 

Results of Query Times

 
 
results

We can draw several insights from our results.

  1. The time to vector tile the two abs_sa211 tables are orders of magnitude faster than the abs_aus11 tables. The reason for this is that abs_sa211 tables consist of smaller polygon geometries than the abs_aus11 tables, and that the ST_Intersection and union of small features is much faster than ST_Intersection of big features. The "eating the elephant in small bites" principle. The very fast "single process" query times for abs_sa211 tables also suggest there is little advantage to be gained by parallelising the tiling process for abs_sa211 with GNU Parallel.
  2. Regardless of whether you use a single process or GNU parallel, the dumping of multipolygon elements using ST_Dump is good practice that can produce significant performance gains. When tiling the abs_aus11 tables, the "dumping" step alone gave in excess of a 7x performance improvement. Multipolygon elements are stored as arrays, and are represented by one large bounding box. Hence indexes on large multipolygons are less effective than if each polygon element is dumped and stored as its own row.
  3. A single query process can at times be faster than a GNU Parallel process that treats each tile as a separate job (i.e. batch size = 1). However this may not always hold true, as with abs_aus11_dumped.
  4. Tile batch size is an important pipeline tuning parameter. Rather than treat each tile as a separate job, it is faster to process the tiles in batches. For this tutorial, 500 tiles per batch produced the fastest overall query time. However this may not hold true for every use case. When processing big, computationally complex queries and data pipelines (such as producing Open Street Map routable vector tiles), we find the processing of individual tiles as separate jobs to be much faster and easier than in batch.
  5. Another tuning parameter which we haven't mentioned is the tile size itself. This tutorial assumed a 32km x 32km tile size. Generally speaking, smaller tile sizes will take longer to process than larger tile sizes, in which GNU Parallel and multiple cores become advantageous.
Conclusions

PostGIS and the postgis community provides us with an amazing tool and support base that we'd be at a loss without. When learning PostGIS, we tend to invest in great books and incrementally read up on the spatial functions relevant to each use case. But in the emerging era of big data, a topic I believe deserves more attention is that of query "scalability". Often queries that perform "okay" on small datasets, or "pilots" do not always scale well on big datasets - in which case the query patterns themselves can act as "pipeline bottlenecks" in technical speak, or "project showstoppers" in business speak.

This tutorial introduces several techniques for efficiently working with big datasets in PostGIS that easily lend themselves to "scaling out" in a multicore platform environment. However understanding the performance of different query designs and patterns is a critical pre-cursor to the solution scaling phase where often small and "acceptable" inefficiencies can take on ugly new complexions.

Personally I find "Eating the Elephant in Small Bites" an apt and memorable adage that quintessentially captures the essence of vector tiling and query parallelisation in PostGIS. There are many additional factors and techniques to consider when applying these methods in practice (such as substituting the use of a regular grid with a quadgrid function that sees the vector tiles recursively subdivide in size to a given depth or feature count). But I took my own advice, and decided not to "try to eat the elephant in one bite". The topic is simply too broad to do the discussion justice in one post. I nevertheless hope that this tutorial will stimulate thinking whilst raising awareness amongst postgis community members of the possibilities offered by vector tiling and Map Reduce concepts.

In my next post, I will apply the techniques introduced in this tutorial to efficiently vector tile an Alberta Land Use dataset that is beset with large, highly complex polygons - a query run time that can be reduced from many days to several minutes.