PostgreSQL Parallel Sequence Scan was committed into the git master a couple weeks ago, and initial tests have been promising.
So far, the parallel functionality is only very limited, just sequence scans, but that's still enough to test out against basic PostGIS functions.
- Compiled PostgreSQL from the master mirror at GitHub
- Compiled the PostGIS from SVN trunk
- Loaded my favourite collection of BC voting areas, then duplicated them 100 times to create a table of 800K polygons
pramsey=# select count(*) from vabig ; count -------- 816700 (1 row) Time: 434.291 ms
In order to avoid problems with functions that have side effects, the parallel query support requires that functions that can be parallelized be declared
PARALLEL SAFE. So I have to re-declare some PostGIS functions.
CREATE OR REPLACE FUNCTION ST_Area(geometry) RETURNS FLOAT8 AS '$libdir/postgis-2.3','area' LANGUAGE 'c' IMMUTABLE STRICT PARALLEL SAFE;
Then, in order to get parallel behaviour, the
max_parallel_degree variable has to be set to a non-zero value. The variable controls roughly how many CPUs will be working on a given statement, and defaults to zero, disabling parallel query support.
set max_parallel_degree = 0;
Now we can run test queries, increasing the parallelism each time. Here's the baseline query:
pramsey=# set max_parallel_degree = 0; pramsey=# select count(gid) from vabig where st_area(geom) > 4000000000; count ------- 5500 (1 row) Time: 2316.670 ms
I increased the
max_parallel_degree and re-ran the query each time:
Testing on a dual-core laptop, the first surprise is that the first
max_parallel_degree halves the time. Perhaps the number of workers is in addition to the parent process, so at that point both CPUs are hot. As further workers are added, the overhead of setting up the parallel process seems to swamp any gains from adding help, though things do keep getting a little better.
Once again, in order to get parallel behaviour, re-defining the function as "safe" is the first thing to do.
CREATE OR REPLACE FUNCTION ST_Buffer(geometry,float8) RETURNS geometry AS '$libdir/postgis-2.3','buffer' LANGUAGE 'c' IMMUTABLE STRICT COST 1000 PARALLEL SAFE;
This time the test query uses buffer and area:
pramsey=# set max_parallel_degree = 0; pramsey=# select count(gid) from vasm where st_area(st_buffer(geom,50)) > 4000000000; count ------- 550 (1 row) Time: 13328.808 ms
The full test still shows good performance improvements:
This is a much heavier calculation than the area, but still shows basically the same progression, with a little surprising extra boost as more concurrency over the number of cores are added. So, using PostGIS functions in filters on sequence scans, we get pretty good gains, using all cores.
The developers have been clear that this work is mostly about putting the framework in place, and that the real gains won't happen until other enhancements have been added, and that's easy to demonstrate.
max_parallel_degree set to four:
pramsey=# select sum(st_area(geom)) from vabig ; sum ----------------- 106999981737570 (1 row) Time: 7919.843 ms pramsey=# select count(gid) from vabig where st_area(geom) > 0; count -------- 816700 (1 row) Time: 1859.687 ms
The amount of work being done is basically the same: calculate the area of every polygon in the table. But it's only being parallelized when the work is part of the filter, not when it's part of the result.
Eventually, operations like parallel aggregates and parallel joins will become available and things will really cook, but for now it's very exciting to see our first parallel queries running in PostgreSQL!