Skip to content

Instantly share code, notes, and snippets.

@pramsey
Last active December 6, 2015 21:04
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pramsey/84e7a39d83cccae692f8 to your computer and use it in GitHub Desktop.
Save pramsey/84e7a39d83cccae692f8 to your computer and use it in GitHub Desktop.
Parallel Sequence Scan and PostGIS

Parallel PostGIS

PostgreSQL Parallel Sequence Scan was committed into the git master a couple weeks ago, and initial tests have been promising.

So far, the parallel functionality is only very limited, just sequence scans, but that's still enough to test out against basic PostGIS functions.

Setup

  • Compiled PostgreSQL from the master mirror at GitHub
  • Compiled the PostGIS from SVN trunk
  • Loaded my favourite collection of BC voting areas, then duplicated them 100 times to create a table of 800K polygons
pramsey=# select count(*) from vabig ;
 count  
--------
 816700
(1 row)

Time: 434.291 ms

Testing

ST_Area()

In order to avoid problems with functions that have side effects, the parallel query support requires that functions that can be parallelized be declared PARALLEL SAFE. So I have to re-declare some PostGIS functions.

CREATE OR REPLACE FUNCTION ST_Area(geometry)
    RETURNS FLOAT8
    AS '$libdir/postgis-2.3','area'
    LANGUAGE 'c' IMMUTABLE STRICT PARALLEL SAFE;

Then, in order to get parallel behaviour, the max_parallel_degree variable has to be set to a non-zero value. The variable controls roughly how many CPUs will be working on a given statement, and defaults to zero, disabling parallel query support.

set max_parallel_degree = 0;

Now we can run test queries, increasing the parallelism each time. Here's the baseline query:

pramsey=# set max_parallel_degree = 0;
pramsey=# select count(gid) from vabig where st_area(geom) > 4000000000;
 count 
-------
  5500
(1 row)

Time: 2316.670 ms

I increased the max_parallel_degree and re-ran the query each time:

max_parallel_degree time
0 2316.670 ms
1 1467.550 ms
2 1316.406 ms
3 1230.150 ms
4 1242.345 ms
5 1261.117 ms

Testing on a dual-core laptop, the first surprise is that the first max_parallel_degree halves the time. Perhaps the number of workers is in addition to the parent process, so at that point both CPUs are hot. As further workers are added, the overhead of setting up the parallel process seems to swamp any gains from adding help, though things do keep getting a little better.

ST_Buffer()

Once again, in order to get parallel behaviour, re-defining the function as "safe" is the first thing to do.

CREATE OR REPLACE FUNCTION ST_Buffer(geometry,float8)
    RETURNS geometry
    AS '$libdir/postgis-2.3','buffer'
    LANGUAGE 'c' IMMUTABLE STRICT
    COST 1000 PARALLEL SAFE;

This time the test query uses buffer and area:

pramsey=# set max_parallel_degree = 0;
pramsey=# select count(gid) from vasm where st_area(st_buffer(geom,50)) > 4000000000;
 count 
-------
   550
(1 row)

Time: 13328.808 ms

The full test still shows good performance improvements:

max_parallel_degree time
0 13328.808 ms
1 8146.826 ms
2 7441.192 ms
3 7206.604 ms
4 7170.295 ms
5 7087.367 ms

This is a much heavier calculation than the area, but still shows basically the same progression, with a little surprising extra boost as more concurrency over the number of cores are added. So, using PostGIS functions in filters on sequence scans, we get pretty good gains, using all cores.

Quirks

The developers have been clear that this work is mostly about putting the framework in place, and that the real gains won't happen until other enhancements have been added, and that's easy to demonstrate.

With max_parallel_degree set to four:

pramsey=# select sum(st_area(geom)) from vabig ;
       sum       
-----------------
 106999981737570
(1 row)

Time: 7919.843 ms


pramsey=# select count(gid) from vabig where st_area(geom) > 0;
 count  
--------
 816700
(1 row)

Time: 1859.687 ms

The amount of work being done is basically the same: calculate the area of every polygon in the table. But it's only being parallelized when the work is part of the filter, not when it's part of the result.

Eventually, operations like parallel aggregates and parallel joins will become available and things will really cook, but for now it's very exciting to see our first parallel queries running in PostgreSQL!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment