PostgreSQL Parallel Sequence Scan was committed into the git master a couple weeks ago, and initial tests have been promising.
So far, the parallel functionality is only very limited, just sequence scans, but that's still enough to test out against basic PostGIS functions.
- Compiled PostgreSQL from the master mirror at GitHub
- Compiled the PostGIS from SVN trunk
- Loaded my favourite collection of BC voting areas, then duplicated them 100 times to create a table of 800K polygons
pramsey=# select count(*) from vabig ;
count
--------
816700
(1 row)
Time: 434.291 ms
In order to avoid problems with functions that have side effects, the parallel query support requires that functions that can be parallelized be declared PARALLEL SAFE
. So I have to re-declare some PostGIS functions.
CREATE OR REPLACE FUNCTION ST_Area(geometry)
RETURNS FLOAT8
AS '$libdir/postgis-2.3','area'
LANGUAGE 'c' IMMUTABLE STRICT PARALLEL SAFE;
Then, in order to get parallel behaviour, the max_parallel_degree
variable has to be set to a non-zero value. The variable controls roughly how many CPUs will be working on a given statement, and defaults to zero, disabling parallel query support.
set max_parallel_degree = 0;
Now we can run test queries, increasing the parallelism each time. Here's the baseline query:
pramsey=# set max_parallel_degree = 0;
pramsey=# select count(gid) from vabig where st_area(geom) > 4000000000;
count
-------
5500
(1 row)
Time: 2316.670 ms
I increased the max_parallel_degree
and re-ran the query each time:
max_parallel_degree | time |
---|---|
0 | 2316.670 ms |
1 | 1467.550 ms |
2 | 1316.406 ms |
3 | 1230.150 ms |
4 | 1242.345 ms |
5 | 1261.117 ms |
Testing on a dual-core laptop, the first surprise is that the first max_parallel_degree
halves the time. Perhaps the number of workers is in addition to the parent process, so at that point both CPUs are hot. As further workers are added, the overhead of setting up the parallel process seems to swamp any gains from adding help, though things do keep getting a little better.
Once again, in order to get parallel behaviour, re-defining the function as "safe" is the first thing to do.
CREATE OR REPLACE FUNCTION ST_Buffer(geometry,float8)
RETURNS geometry
AS '$libdir/postgis-2.3','buffer'
LANGUAGE 'c' IMMUTABLE STRICT
COST 1000 PARALLEL SAFE;
This time the test query uses buffer and area:
pramsey=# set max_parallel_degree = 0;
pramsey=# select count(gid) from vasm where st_area(st_buffer(geom,50)) > 4000000000;
count
-------
550
(1 row)
Time: 13328.808 ms
The full test still shows good performance improvements:
max_parallel_degree | time |
---|---|
0 | 13328.808 ms |
1 | 8146.826 ms |
2 | 7441.192 ms |
3 | 7206.604 ms |
4 | 7170.295 ms |
5 | 7087.367 ms |
This is a much heavier calculation than the area, but still shows basically the same progression, with a little surprising extra boost as more concurrency over the number of cores are added. So, using PostGIS functions in filters on sequence scans, we get pretty good gains, using all cores.
The developers have been clear that this work is mostly about putting the framework in place, and that the real gains won't happen until other enhancements have been added, and that's easy to demonstrate.
With max_parallel_degree
set to four:
pramsey=# select sum(st_area(geom)) from vabig ;
sum
-----------------
106999981737570
(1 row)
Time: 7919.843 ms
pramsey=# select count(gid) from vabig where st_area(geom) > 0;
count
--------
816700
(1 row)
Time: 1859.687 ms
The amount of work being done is basically the same: calculate the area of every polygon in the table. But it's only being parallelized when the work is part of the filter, not when it's part of the result.
Eventually, operations like parallel aggregates and parallel joins will become available and things will really cook, but for now it's very exciting to see our first parallel queries running in PostgreSQL!