pramsey/pgsql-parallel-postgis.md

## pgsql-parallel-postgis.md

      
    Raw
  

              pgsql-parallel-postgis.md
            
          
    Parallel PostGIS

PostgreSQL Parallel Sequence Scan was committed into the git master a couple weeks ago, and initial tests have been promising.

Robert Haas: Parallel Sequential Scan is Committed!
Amit Kapila: Parallel Sequential Scans in play

So far, the parallel functionality is only very limited, just sequence scans, but that's still enough to test out against basic PostGIS functions.
Setup


Compiled PostgreSQL from the master mirror at GitHub
Compiled the PostGIS from SVN trunk
Loaded my favourite collection of BC voting areas, then duplicated them 100 times to create a table of 800K polygons

pramsey=# select count(*) from vabig ;
 count  
--------
 816700
(1 row)

Time: 434.291 ms

Testing

ST_Area()

In order to avoid problems with functions that have side effects, the parallel query support requires that functions that can be parallelized be declared PARALLEL SAFE. So I have to re-declare some PostGIS functions.
CREATE OR REPLACE FUNCTION ST_Area(geometry)
    RETURNS FLOAT8
    AS '$libdir/postgis-2.3','area'
    LANGUAGE 'c' IMMUTABLE STRICT PARALLEL SAFE;

Then, in order to get parallel behaviour, the max_parallel_degree variable has to be set to a non-zero value. The variable controls roughly how many CPUs will be working on a given statement, and defaults to zero, disabling parallel query support.
set max_parallel_degree = 0;

Now we can run test queries, increasing the parallelism each time. Here's the baseline query:
pramsey=# set max_parallel_degree = 0;
pramsey=# select count(gid) from vabig where st_area(geom) > 4000000000;
 count 
-------
  5500
(1 row)

Time: 2316.670 ms

I increased the max_parallel_degree and re-ran the query each time:


max_parallel_degree
time


0
2316.670 ms


1
1467.550 ms


2
1316.406 ms


3
1230.150 ms


4
1242.345 ms


5
1261.117 ms


Testing on a dual-core laptop, the first surprise is that the first max_parallel_degree halves the time. Perhaps the number of workers is in addition to the parent process, so at that point both CPUs are hot. As further workers are added, the overhead of setting up the parallel process seems to swamp any gains from adding help, though things do keep getting a little better.
ST_Buffer()

Once again, in order to get parallel behaviour, re-defining the function as "safe" is the first thing to do.
CREATE OR REPLACE FUNCTION ST_Buffer(geometry,float8)
    RETURNS geometry
    AS '$libdir/postgis-2.3','buffer'
    LANGUAGE 'c' IMMUTABLE STRICT
    COST 1000 PARALLEL SAFE;
This time the test query uses buffer and area:
pramsey=# set max_parallel_degree = 0;
pramsey=# select count(gid) from vasm where st_area(st_buffer(geom,50)) > 4000000000;
 count 
-------
   550
(1 row)

Time: 13328.808 ms

The full test still shows good performance improvements:


max_parallel_degree
time


0
13328.808 ms


1
8146.826 ms


2
7441.192 ms


3
7206.604 ms


4
7170.295 ms


5
7087.367 ms


This is a much heavier calculation than the area, but still shows basically the same progression, with a little surprising extra boost as more concurrency over the number of cores are added. So, using PostGIS functions in filters on sequence scans, we get pretty good gains, using all cores.
Quirks

The developers have been clear that this work is mostly about putting the framework in place, and that the real gains won't happen until other enhancements have been added, and that's easy to demonstrate.
With max_parallel_degree set to four:
pramsey=# select sum(st_area(geom)) from vabig ;
       sum       
-----------------
 106999981737570
(1 row)

Time: 7919.843 ms


pramsey=# select count(gid) from vabig where st_area(geom) > 0;
 count  
--------
 816700
(1 row)

Time: 1859.687 ms

The amount of work being done is basically the same: calculate the area of every polygon in the table. But it's only being parallelized when the work is part of the filter, not when it's part of the result.
Eventually, operations like parallel aggregates and parallel joins will become available and things will really cook, but for now it's very exciting to see our first parallel queries running in PostgreSQL!
max_parallel_degree	time
0	2316.670 ms
1	1467.550 ms
2	1316.406 ms
3	1230.150 ms
4	1242.345 ms
5	1261.117 ms
max_parallel_degree	time
0	13328.808 ms
1	8146.826 ms
2	7441.192 ms
3	7206.604 ms
4	7170.295 ms
5	7087.367 ms