Blog


From Days to Minutes – Geoprocessing of Alberta Land Use Data

When writing my first blog post “Eating the Elephant In Small Bites”, a question on the PostGIS mailing list caught my eye because the challenge posed was not dissimilar to something I once encountered when working with Australian land use data. This time the dataset related to land cover in Alberta, Canada, and the PostGIS query using ST_Interection() was reportedly taking many, many days to process one particular class of land cover. Fortunately the Alberta dataset is freely available, so I thought to test my PostGIS vector tiling and Map Reduce code pattern – and with great effect! This post describes how I addressed the Alberta land class problem, building upon the concepts and practices introduced in my first blog post to deliver query times of just minutes. As with the “Eating the Elephant in Small Bites” tutorial, our machine setup for this exercise is a CentOS6.5 Linux server instance running on AWS, with 16vCPUs and 30GB of memory.

The full code for this solution can be downloaded from github here

Data Understanding

Upon downloading and ingesting the Alberta land cover dataset into PostGIS, the problem becomes self-evident. The land class ‘lc_class 34’ comprises the least number of geometries, but with an average of 119 rings per polygon.

SELECT lc_class, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points, round(SUM(ST_NRings(wkb_geometry))::numeric/SUM(ST_NumGeometries(wkb_geometry))::numeric,3) as rings_per_geom FROM lancover_polygons_2010 GROUP BY lc_class ORDER BY 1;

 lc_class | num_geoms | num_points | rings_per_geom 
----------+-----------+------------+----------------
       20 |     83793 |    1306686 |          1.029
       31 |      1262 |      33366 |          1.115
       32 |      5563 |     291666 |          1.560
       33 |     12231 |     198385 |          1.023
       34 |       366 |    2646625 |        119.046
       50 |    196681 |    4750590 |          1.165
      110 |    154816 |    3137196 |          1.078
      120 |     83069 |    2833293 |          1.282
      210 |    150550 |    6260788 |          1.522
      220 |    197666 |    4793592 |          1.150
      230 |    112034 |    2419651 |          1.045
(11 rows)

Visually, this is how ‘lc_class 34’ appears in QGIS with green fill.

lc_class34

lc_class34_zoom

Upon closer inspection, we see ‘lc_class 34’ has a trellis-like structure where each contiguous block of white space (i.e. the “negative space”) is a polygon inner ring. And therein lies the problem. Like with multipolygons, complex polygons with inner rings are stored as arrays, meaning queries are often slowed by the nature of array operations, plus indexes are less effective than if each array element is dumped and stored as its own row. But luckily for us, there are standard PostGIS functions we can use to address this.

Solution

The solution to the Alberta ‘lc_class 34’ problem I will describe sequentially, starting with how to prepare the data by dumping the multipolygons and polyon rings, followed by a description of the fn_worker SQL code pattern. Finally we parallelise the code across all available cores.

Data preparation involves a two step “dumping” process, starting with ST_Dump() of the multipolygons, followed by ST_DumpRings() of the polygons dumped by step 1. Note that we record the path number of each polygon ring that is dumped. Exterior rings by default have a path value = 0, whilst interior rings have a path value > 0 signifying the interior ring number.

  1. # Create PostGIS tables (with indexes) for the dumped polygons and their rings
  2. SQL=$(cat<<EOF      
  3. -------------------------        
  4. ------- SQL BLOCK -------        
  5. -------------------------        
  6. DROP TABLE IF EXISTS landcover_dumped_34;        
  7. CREATE TABLE landcover_dumped_34 (
  8. fid serial,
  9. ogc_fid integer,
  10. wkb_geometry geometry(Polygon, 3400),
  11. lc_class integer,
  12. mod_ty text,
  13. shape_length float8,
  14. shape_area float8
  15. );
  16.  
  17. INSERT INTO landcover_dumped_34
  18. SELECT
  19. NEXTVAL('landcover_dumped_34_fid_seq'),
  20. ogc_fid,
  21. (ST_Dump(wkb_geometry)).geom,
  22. lc_class,
  23. mod_ty,
  24. shape_length,
  25. shape_area
  26. FROM lancover_polygons_2010 WHERE lc_class = 34;
  27.  
  28. DROP TABLE IF EXISTS landcover_dumped_34_rings;      
  29. CREATE TABLE landcover_dumped_34_rings (
  30. fid serial,
  31. ogc_fid integer,
  32. path integer,
  33. wkb_geometry geometry(Polygon, 3400)
  34. );
  35.  
  36. INSERT INTO landcover_dumped_34_rings
  37. WITH s AS (SELECT ogc_fid, (ST_DumpRings(wkb_geometry)).path as path, ST_Buffer(ST_SnapToGrid((ST_DumpRings(wkb_geometry)).geom,0.1)) as the_geom FROM  landcover_dumped_34)  --SNAP and BUFFER TO MAKEVALID
  38. SELECT
  39. NEXTVAL('landcover_dumped_34_rings_fid_seq'),
  40. ogc_fid,
  41. path[1],
  42. the_geom
  43. FROM s;
  44.  
  45. CREATE INDEX landcover_dumped_34_rings_wkb_geometry_idx ON landcover_dumped_34_rings USING GIST(wkb_geometry);
  46. ANALYZE landcover_dumped_34_rings;
  47.  
  48. -------------------------        
  49. EOF
  50. )
  51. echo "$SQL"  # comment to suppress printing
  52. # execute SQL STATEMENT or comment # to skip             
  53. psql -U $username -d $dbname -c "$SQL"  ### alternatively, comment out line with single '#' to skip this step
# Create PostGIS tables (with indexes) for the dumped polygons and their rings
SQL=$(cat<<EOF 	 	
-------------------------	 	 
------- SQL BLOCK -------	 	 
-------------------------	 	 
DROP TABLE IF EXISTS landcover_dumped_34;	 	 
CREATE TABLE landcover_dumped_34 (
fid serial,
ogc_fid integer,
wkb_geometry geometry(Polygon, 3400),
lc_class integer,
mod_ty text,
shape_length float8,
shape_area float8
);

INSERT INTO landcover_dumped_34
SELECT
NEXTVAL('landcover_dumped_34_fid_seq'),
ogc_fid,
(ST_Dump(wkb_geometry)).geom,
lc_class,
mod_ty,
shape_length,
shape_area
FROM lancover_polygons_2010 WHERE lc_class = 34;

DROP TABLE IF EXISTS landcover_dumped_34_rings;	 	 
CREATE TABLE landcover_dumped_34_rings (
fid serial,
ogc_fid integer,
path integer,
wkb_geometry geometry(Polygon, 3400)
);

INSERT INTO landcover_dumped_34_rings
WITH s AS (SELECT ogc_fid, (ST_DumpRings(wkb_geometry)).path as path, ST_Buffer(ST_SnapToGrid((ST_DumpRings(wkb_geometry)).geom,0.1)) as the_geom FROM  landcover_dumped_34)  --SNAP and BUFFER TO MAKEVALID
SELECT
NEXTVAL('landcover_dumped_34_rings_fid_seq'),
ogc_fid,
path[1],
the_geom
FROM s;

CREATE INDEX landcover_dumped_34_rings_wkb_geometry_idx ON landcover_dumped_34_rings USING GIST(wkb_geometry);
ANALYZE landcover_dumped_34_rings;

-------------------------	 	 
EOF
)
echo "$SQL"  # comment to suppress printing
# execute SQL STATEMENT or comment # to skip		 	 
psql -U $username -d $dbname -c "$SQL"  ### alternatively, comment out line with single '#' to skip this step

Next we generate a grid layout to serve as the basis for our tiles. For this exercise, we will use a 2km x 2km tile size as we wish to keep our "chopped" up geometries fairly small to enable fast queries of the 'lc_class 34' tiled geometries at a later date.

  1. #######################################
  2. #########  GRID CREATION  ##########
  3. #######################################
  4. # bash function for creating a regular vector grid in PostGIS
  5. fn_generategrid() {
  6. # get sql custom functions
  7. wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/lattices/DE_RegularGrid.sql -O DE_RegularGrid.sql
  8. wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/shapes/DE_MakeSquare.sql -O DE_MakeSquare.sql
  9. # load sql custom functions
  10. for i in *.sql; do
  11. psql -U $username -d $dbname -f $i
  12. done
  13.  
  14. SQL=$(cat<<EOF      
  15. -------------------------        
  16. ------- SQL BLOCK -------        
  17. -------------------------
  18. DROP TABLE IF EXISTS regular_grid_2k;
  19. CREATE TABLE regular_grid_2k AS (
  20. WITH s AS (SELECT DE_RegularGrid(ST_Envelope(ST_Collect(wkb_geometry)),2000) as wkb_geometry FROM abmiw2wlcv_48tiles)
  21. SELECT row_number() over() as tid, wkb_geometry::geometry(Polygon, 3400) FROM s);
  22.  
  23. -------------------------        
  24. EOF
  25. )
  26. echo "$SQL"  # comment to suppress printing
  27. # execute SQL STATEMENT or comment # to skip       
  28. psql -U $username -d $dbname -c "$SQL"
  29. }
  30. # end of bash function
  31.  
  32. # call the bash function or comment # to skip
  33. fn_generategrid
  34.  
  35. #######################################
#######################################
#########  GRID CREATION  ##########
#######################################
# bash function for creating a regular vector grid in PostGIS
fn_generategrid() {
# get sql custom functions
wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/lattices/DE_RegularGrid.sql -O DE_RegularGrid.sql
wget https://raw.githubusercontent.com/dimensionaledge/cf_public/master/shapes/DE_MakeSquare.sql -O DE_MakeSquare.sql
# load sql custom functions
for i in *.sql; do
psql -U $username -d $dbname -f $i
done

SQL=$(cat<<EOF 	 	
-------------------------	 	 
------- SQL BLOCK -------	 	 
-------------------------
DROP TABLE IF EXISTS regular_grid_2k;
CREATE TABLE regular_grid_2k AS (
WITH s AS (SELECT DE_RegularGrid(ST_Envelope(ST_Collect(wkb_geometry)),2000) as wkb_geometry FROM abmiw2wlcv_48tiles)
SELECT row_number() over() as tid, wkb_geometry::geometry(Polygon, 3400) FROM s);

-------------------------	 	 
EOF
)
echo "$SQL"  # comment to suppress printing
# execute SQL STATEMENT or comment # to skip	 	
psql -U $username -d $dbname -c "$SQL"
}
# end of bash function

# call the bash function or comment # to skip
fn_generategrid 

#######################################

We also create a vector tile table to hold the outputs of our worker function. Note that the worker function does the tiling of the exterior rings separately to the interior rings. The unions of each are taken before applying ST_Difference to remove the "negative space" of the interior rings from each tile.

  1. #######################################
  2. #####  DEFINE WORKER FUNCTION  ######
  3. #######################################
  4. # define the worker function to be executed across all cores
  5. fn_worker (){
  6. source $dbsettings
  7. SQL=$(cat<<EOF
  8. -------------------------
  9. ----- SQL STATEMENT -----
  10. -------------------------
  11. INSERT INTO vector_tiles
  12. WITH
  13. f0 AS (
  14.         SELECT
  15.         tid,
  16.         wkb_geometry as the_geom
  17.         FROM regular_grid_2k
  18.         WHERE tid >= $1 AND tid < $2
  19.         ),
  20. f1_p0 AS (
  21.         SELECT
  22.         f0.tid,
  23.         CASE WHEN ST_Within(f0.the_geom,rt.wkb_geometry) THEN f0.the_geom
  24.         ELSE ST_Intersection(f0.the_geom,rt.wkb_geometry) END as the_geom
  25.         FROM f0, $3 as rt
  26.         WHERE ST_Intersects(f0.the_geom, rt.wkb_geometry) AND f0.the_geom && rt.wkb_geometry AND rt.path = 0
  27.         ),
  28. f1_p0u AS (
  29.         SELECT
  30.         tid,
  31.         ST_Union(the_geom) as the_geom
  32.         FROM f1_p0
  33.         GROUP BY tid
  34.         ),
  35. f1_p1 AS (
  36.         SELECT
  37.         f0.tid,
  38.         CASE WHEN ST_Within(f0.the_geom,rt.wkb_geometry) THEN f0.the_geom
  39.         ELSE ST_Intersection(f0.the_geom,rt.wkb_geometry)
  40.         END as the_geom
  41.         FROM f0, $3 as rt
  42.         WHERE ST_Intersects(f0.the_geom, rt.wkb_geometry) AND f0.the_geom && rt.wkb_geometry AND rt.path > 0
  43.         ),
  44. f1_p1u AS (
  45.         SELECT
  46.         tid,
  47.         ST_Union(the_geom) as the_geom
  48.         FROM f1_p1
  49.         GROUP BY tid
  50.         ),
  51. f2 AS (
  52.         SELECT
  53.         f1_p0u.tid,
  54.         CASE WHEN  f1_p1u.tid IS NULL THEN f1_p0u.the_geom
  55.         WHEN ST_IsEmpty(ST_Difference(f1_p0u.the_geom,f1_p1u.the_geom)) THEN NULL
  56.         ELSE ST_Difference(f1_p0u.the_geom,f1_p1u.the_geom)
  57.         END as the_geom
  58.         FROM f1_p0u LEFT JOIN  f1_p1u
  59.         ON f1_p0u.tid = f1_p1u.tid
  60.         )
  61. ------------------------
  62. ---- result ----
  63. ------------------------
  64. SELECT
  65. NEXTVAL('vector_tiles_fid_seq'),
  66. tid,
  67. ST_Multi(the_geom),
  68. 1
  69. FROM f2
  70. WHERE the_geom IS NOT NULL;
  71. -------------------------
  72. EOF
  73. )
  74. echo "$SQL"  # comment to suppress printing
  75. # execute SQL STATEMENT
  76. psql -U $username -d $dbname -c "$SQL"
  77. }
  78. # end of worker function
  79.  
  80. # make worker function visible to GNU Parallel across all cores
  81. export -f fn_worker
  82.  
  83. #######################################
#######################################
#####  DEFINE WORKER FUNCTION  ######
#######################################
# define the worker function to be executed across all cores
fn_worker (){
source $dbsettings
SQL=$(cat<<EOF
-------------------------
----- SQL STATEMENT -----
-------------------------
INSERT INTO vector_tiles
WITH
f0 AS (
		SELECT
		tid,
		wkb_geometry as the_geom
		FROM regular_grid_2k
		WHERE tid >= $1 AND tid < $2
		),
f1_p0 AS (
		SELECT
		f0.tid,
		CASE WHEN ST_Within(f0.the_geom,rt.wkb_geometry) THEN f0.the_geom
		ELSE ST_Intersection(f0.the_geom,rt.wkb_geometry) END as the_geom
		FROM f0, $3 as rt
		WHERE ST_Intersects(f0.the_geom, rt.wkb_geometry) AND f0.the_geom && rt.wkb_geometry AND rt.path = 0
		),
f1_p0u AS (
		SELECT
		tid,
		ST_Union(the_geom) as the_geom
		FROM f1_p0
		GROUP BY tid
		),
f1_p1 AS (
		SELECT
		f0.tid,
		CASE WHEN ST_Within(f0.the_geom,rt.wkb_geometry) THEN f0.the_geom
		ELSE ST_Intersection(f0.the_geom,rt.wkb_geometry)
		END as the_geom
		FROM f0, $3 as rt
		WHERE ST_Intersects(f0.the_geom, rt.wkb_geometry) AND f0.the_geom && rt.wkb_geometry AND rt.path > 0
		),
f1_p1u AS (
		SELECT
		tid,
		ST_Union(the_geom) as the_geom
		FROM f1_p1
		GROUP BY tid
		),
f2 AS (
		SELECT
		f1_p0u.tid,
		CASE WHEN  f1_p1u.tid IS NULL THEN f1_p0u.the_geom
		WHEN ST_IsEmpty(ST_Difference(f1_p0u.the_geom,f1_p1u.the_geom)) THEN NULL
		ELSE ST_Difference(f1_p0u.the_geom,f1_p1u.the_geom)
		END as the_geom
		FROM f1_p0u LEFT JOIN  f1_p1u
		ON f1_p0u.tid = f1_p1u.tid
		)
------------------------
---- result ----
------------------------
SELECT
NEXTVAL('vector_tiles_fid_seq'),
tid,
ST_Multi(the_geom),
1
FROM f2
WHERE the_geom IS NOT NULL;
-------------------------
EOF
)
echo "$SQL"  # comment to suppress printing
# execute SQL STATEMENT
psql -U $username -d $dbname -c "$SQL"
}
# end of worker function

# make worker function visible to GNU Parallel across all cores
export -f fn_worker

#######################################

Next we create a job list which we proceed to execute in parallel. And voilà!

  1. #######################################
  2. ##########  CREATE JOB LIST  ###########
  3. #######################################
  4. # create job list to feed GNU Parallel.
  5. SQL=$(cat<<EOF      
  6. -------------------------        
  7. ------- SQL BLOCK -------        
  8. -------------------------
  9. -- create joblist where block size = 1000 tiles (i.e. tiles processed in batches of 1000)
  10. COPY (SELECT i as lower, i+1000 as upper FROM generate_series(1,250000,1000) i) TO STDOUT WITH CSV;      
  11. -------------------------
  12. EOF
  13. )        
  14. echo "$SQL"  # comment to suppress printing
  15. # execute SQL STATEMENT    
  16. psql -U $username -d $dbname -c "$SQL" > joblist.csv
  17.  
  18. #######################################
  19.  
  20. #######################################
  21. ##########  EXECUTE JOBS  ###########
  22. #######################################
  23. cat joblist.csv | parallel --colsep ',' fn_worker {1} {2} landcover_dumped_34_rings
  24. wait
  25. #######################################
#######################################
##########  CREATE JOB LIST  ###########
#######################################
# create job list to feed GNU Parallel.
SQL=$(cat<<EOF 	 	
-------------------------	 	 
------- SQL BLOCK -------	 	 
-------------------------
-- create joblist where block size = 1000 tiles (i.e. tiles processed in batches of 1000)
COPY (SELECT i as lower, i+1000 as upper FROM generate_series(1,250000,1000) i) TO STDOUT WITH CSV; 	 
-------------------------
EOF
)	 	 
echo "$SQL"  # comment to suppress printing
# execute SQL STATEMENT 	
psql -U $username -d $dbname -c "$SQL" > joblist.csv

#######################################

#######################################
##########  EXECUTE JOBS  ###########
#######################################
cat joblist.csv | parallel --colsep ',' fn_worker {1} {2} landcover_dumped_34_rings
wait
#######################################

Results of Query Times

The results show the impact of a larger tile size on query run times, and the resultant complexity of the tiled geometries produced (measured by the rings_per_geom). Whilst quadrupling the area of the tile size (from 2x2 to 4x4) reduces the run time from 779 seconds to 282 seconds, the rings_per_geom increases - which is understandable given the trellis-like structure of the source data. Both scenarios offer a vast improvement on the many, many days it reportedly took to process the data using a standard ST_Intersection query. As to the acceptability of the tile-size trade-off, I think it really depends on the criticality of query times with respect to the downstream processes that will ultimately consume the tiled geometries. It thus becomes a matter of judgement.

2k x 2k tiles
TOTAL SCRIPT TIME: 779 (batch size =1000 tiles)
 num_geoms | num_points | rings_per_geom   
-----------+------------+----------------
    144799 |    3471051 |          1.055

4k x 4k tiles
TOTAL SCRIPT TIME: 282 (batch size =250 tiles)
 num_geoms | num_points | rings_per_geom   
-----------+------------+----------------
     57775 |    2966397 |          1.263

lc_class34_tiled

Conclusions

The Alberta land cover problem highlights a few of the challenges that analysts typically encounter when working with large, complex datasets. The solution presented here is a practical example of how vector tiling and Map Reduce concepts can deliver real geo-processing efficiency improvements. However the real business value of reducing query times - from days to minutes - is to accelerate the 'business time to insight' and to amplify the speed or volume of iterations the analyst can consider as part of the geospatial 'value discovery' process. As this exercise attests, the business impact that Spatial IT can make over traditional desktop GIS approaches in the context of ever-shortening decision windows is nothing short of 'game changing'. Welcome to the emerging world of Big Data.