Skip to content

Instantly share code, notes, and snippets.

@jgoizueta
Last active December 22, 2015 15:41
Show Gist options
  • Save jgoizueta/f423be9d4084a484e520 to your computer and use it in GitHub Desktop.
Save jgoizueta/f423be9d4084a484e520 to your computer and use it in GitHub Desktop.
Overviews test data (points)

The data is a subset of the table of cities used by the Geocoder API.

First import the base table: https://drive.google.com/file/d/0B6sNSm2Xp0hnWUY1X0FqNUZxOGc/view?usp=sharing

Then create some overviews. We'll assume the detail of the base table is good for zoom level 9, so the base table will be used for all levels greater than or equal to 9. A simplistic approach will be used: for each level less than 9 a fourth of the points will be randomly sampled to maintain the same point density on screen. We'll create overviews only for 1 zoom level out of 2. (heads up: the overview tables owner should be the database user)

CREATE TABLE cities_subset_ov8 AS
  SELECT * FROM cities_subset WHERE random()<0.25;
CREATE TABLE cities_subset_ov6 AS
  SELECT * FROM cities_subset_ov8 WHERE random()<0.25*0.25;
CREATE TABLE cities_subset_ov4 AS
  SELECT * FROM cities_subset_ov6 WHERE random()<0.25*0.25;
CREATE TABLE cities_subset_ov2 AS
  SELECT * FROM cities_subset_ov4 WHERE random()<0.25*0.25;

We will use:

  • cities_subset for zoom level >= 9
  • cities_subset_ov8 for 9 > zoom level >= 7
  • cities_subset_ov6 for 7 > zoom level >= 5
  • cities_subset_ov4 for 5 > zoom level >= 3
  • cities_subset_ov2 for 3 > zoom level

Now we can create a view to combine all overviews:

CREATE VIEW cities_subset_ov AS
  SELECT * ,9 AS min_zoom FROM cities_subset
  UNION ALL
  SELECT *, 7 AS min_zoom FROM cities_subset_ov8
  UNION ALL
  SELECT *, 5 AS min_zoom FROM cities_subset_ov6
  UNION ALL
  SELECT *, 3 AS min_zoom FROM cities_subset_ov4
  UNION ALL
  SELECT *, 0 AS min_zoom FROM cities_subset_ov2;

Now we need some metadata about the overviews. We'll add a table ov_tables to the user database. We'll do something very simple at the moment and not bother about schemas, etc. (TODO: should we store some info in the metadata database?)

CREATE TABLE ov_tables(base_table text, overview_table text, min_zoom integer);
INSERT INTO ov_tables(base_table, overview_table, min_zoom)
  VALUES ('cities_subset', 'cities_subset', 9);
INSERT INTO ov_tables(base_table, overview_table, min_zoom)
  VALUES ('cities_subset', 'cities_subset_ov8', 7);
INSERT INTO ov_tables(base_table, overview_table, min_zoom)
  VALUES ('cities_subset', 'cities_subset_ov6', 5);
INSERT INTO ov_tables(base_table, overview_table, min_zoom)
  VALUES ('cities_subset', 'cities_subset_ov4', 3);
INSERT INTO ov_tables(base_table, overview_table, min_zoom)
  VALUES ('cities_subset', 'cities_subset_ov2', 0);
@jgoizueta
Copy link
Author

Using materialized views instead of tables will have any advantage?

CREATE MATERIALIZED VIEW cities_subset_ov8 AS
  SELECT * FROM cities_subset WHERE random()<0.25;
CREATE MATERIALIZED VIEW cities_subset_ov6 AS
  SELECT * FROM cities_subset_ov8 WHERE random()<0.25*0.25;
CREATE MATERIALIZED VIEW cities_subset_ov4 AS
  SELECT * FROM cities_subset_ov6 WHERE random()<0.25*0.25;
CREATE MATERIALIZED VIEW cities_subset_ov2 AS
  SELECT * FROM cities_subset_ov4 WHERE random()<0.25*0.25;

@jgoizueta
Copy link
Author

For raster overviews, the metadata is available in a view raster_overviews which obtains the information about which tables are overviews and its reduction factor from constraints that are specific to overviews.

@jgoizueta
Copy link
Author

Clustering

Experiment to generate clustered overlays using kmeans function
(note: #183)

Compile & install extension in local environment:

git clone https://github.com/umitanuki/kmeans-postgresql.git
cd kmeans-postgress
export USE_PGXS=1
make
sudo make install

Install it in a local user database:

psql -f /usr/share/postgresql/9.3/extension/kmeans.sql  -U postgres cartodb_dev_user_d34426d9-dfe6-434c-8d47-323689547836_db 

Now an overview can be generated with:

CREATE TABLE cities_subset_ovcl8 AS
  WITH num_base_points_select AS (
    SELECT count(*) AS num_base_points FROM cities_subset
  )
  SELECT
    kmeans AS cartodb_id,
    count(*) AS clustered_points,
    -- Should add aggregated columns here...
    ST_Centroid(ST_Collect(the_geom)) AS the_geom,
    ST_Centroid(ST_Collect(the_geom_webmercator)) AS the_geom_webmercator
    FROM (
      SELECT
        kmeans(
          ARRAY[ST_X(the_geom), ST_Y(the_geom)],
          (num_base_points/4)::integer) OVER (),
         -- Should add columns to be aggregated here...
          the_geom, the_geom_webmercator
      FROM cities_subset, num_base_points_select
    ) AS ksub
  GROUP BY kmeans
  ORDER BY kmeans;

But this is just too slow: for the base cities_subset table of 62136 points we have:

   5 buckets: Time:  1051.680 ms
  10 buckets: Time:  1399.452 ms
 100 buckets: Time: 10108.003 ms
1000 buckets: Time: 50754.084 ms

@pnorman
Copy link

pnorman commented Dec 14, 2015

I recommend using apt-get install pgxnclient postgresql-server-dev-9.3 and pgxn install kmeans then using CREATE EXTENSION kmeans

@jgoizueta
Copy link
Author

Gridded clustering

This is the full dataset (some 62136 points) al zoom level 2:

unclustered

Now, for comparison here's the data reduced to 100 clusters using kmeans as explained above:

kmeans

Using the same technique as in the Editor cluster wizard we can cluster using a grid. This doesn't have the efficiency problems of kmeans, but its rather ugly:

gridded

WITH meta AS (
  SELECT
    greatest(!pixel_width!,!pixel_height!) as psz,
    ext,
    ST_XMin(ext) xmin, ST_YMin(ext) ymin,
    1 as res
  FROM (
    SELECT !bbox! as ext
  ) a
),
filtered_table AS (
  SELECT t.* FROM (
    select * from cities_subset
  ) t,
  meta m
  WHERE t.the_geom_webmercator && m.ext
)
SELECT
  ST_SnapToGrid(f.the_geom_webmercator, 0, 0, m.psz * m.res, m.psz * m.res) the_geom_webmercator,
  count(*) as points_count
  FROM filtered_table f, meta m
  GROUP BY
    ST_SnapToGrid(f.the_geom_webmercator, 0, 0, m.psz * m.res, m.psz * m.res),
    m.xmin, m.ymin

We can use the centroid of each cluster instead of the grid center points to obtain a more natural look, (for data of homogeneous density
such as this one the enhancement may not be too noticeable):

gridded_centroids

WITH meta AS (
  SELECT
    greatest(!pixel_width!,!pixel_height!) as psz,
    ext,
    ST_XMin(ext) xmin, ST_YMin(ext) ymin,
    1 as res
  FROM (
    SELECT !bbox! as ext
  ) a
),
filtered_table AS (
  SELECT t.* FROM (
    select * from cities_subset
  ) t,
  meta m
  WHERE t.the_geom_webmercator && m.ext
),
clusters AS (
  SELECT
    first_value(f.cartodb_id) OVER (
      PARTITION BY
        ST_SnapToGrid(f.the_geom_webmercator, 0, 0, m.psz * m.res, m.psz * m.res)
    ) AS cartodb_id,
    the_geom_webmercator
    FROM filtered_table f, meta m
)
SELECT
  cartodb_id,
  ST_Centroid(ST_Collect(clusters.the_geom_webmercator)) AS the_geom_webmercator  
FROM clusters 
GROUP BY cartodb_id

@jgoizueta
Copy link
Author

Failed crazy idea: choose a random point in each grid cell instead of the cluster's centroid to achieve much greater performance:

random

WITH meta AS (
  SELECT
    greatest(!pixel_width!,!pixel_height!) as psz,
    ext,
    1 as res
  FROM (
    SELECT !bbox! as ext
  ) a
),
filtered_table AS (
  SELECT t.* FROM (
    select * from cities_subset
  ) t,
  meta m
  WHERE t.the_geom_webmercator && m.ext
),
clusters AS (
  SELECT
    Floor(ST_X(f.the_geom_webmercator)/(m.psz * m.res))::int AS gx,
    Floor(ST_Y(f.the_geom_webmercator)/(m.psz * m.res))::int AS gy,
    row_number() OVER () AS cartodb_id
  FROM filtered_table f, meta m
  GROUP BY gx, gy
)
SELECT
  cartodb_id, 
  ST_SetSRID(
    ST_MakePoint(
      (gx+Random())*m.psz * m.res,
      (gy+Random())*m.psz * m.res
    ), 3857
  ) AS the_geom_webmercator
FROM clusters, meta m;

The randomisation makes less obvious the gridding, but since the point locations has no relation to the original data distribution
the result is much worse than using the actual centroids...

@jgoizueta
Copy link
Author

A way to achieve efficiency by computing a single Group By can be to average coordinates instead of computing cluster centroids:

average

WITH meta AS (
  SELECT
    greatest(!pixel_width!,!pixel_height!) as psz,
    ext,
    1 as res
  FROM (
    SELECT !bbox! as ext
  ) a
),
filtered_table AS (
  SELECT t.* FROM (
    select * from cities_subset
  ) t,
  meta m
  WHERE t.the_geom_webmercator && m.ext
),
clusters AS (
  SELECT
    count(*) AS n,
    SUM(ST_X(f.the_geom_webmercator)) AS sx,
    SUM(ST_Y(f.the_geom_webmercator)) AS sy,
    Floor(ST_X(f.the_geom_webmercator)/(m.psz * m.res))::int AS gx,
    Floor(ST_Y(f.the_geom_webmercator)/(m.psz * m.res))::int AS gy,
    row_number() OVER () AS cartodb_id
  FROM filtered_table f, meta m
  GROUP BY gx, gy
)
SELECT
  cartodb_id, 
  ST_SetSRID(ST_MakePoint(sx/n, sy/n), 3857) AS the_geom_webmercator
FROM clusters, meta m;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment