Skip to content

Instantly share code, notes, and snippets.

@drob
Last active August 29, 2015 13:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save drob/9180685 to your computer and use it in GitHub Desktop.
Save drob/9180685 to your computer and use it in GitHub Desktop.
-- Filter an array of events such that there is only one event with each event_id.
-- When more than one event with the same event_id, is present, take the latest one.
CREATE OR REPLACE FUNCTION dedupe_events_2(events HSTORE[]) RETURNS HSTORE[] AS $$
SELECT array_agg(event)
FROM (
-- Filter for rank = 1, i.e. select the latest event for any collisions on event_id.
SELECT event
FROM (
-- Rank elements with the same event_id by position in the array, descending.
SELECT event, row_number AS index, rank()
OVER (PARTITION BY (event -> 'event_id')::BIGINT ORDER BY row_number DESC)
FROM (
-- Use unnest instead of generate_subscripts to turn an array into a set.
SELECT event, row_number()
OVER (ORDER BY event -> 'time')
FROM unnest(events) AS event
) unnested_data
) deduped_events
WHERE rank = 1
ORDER BY index ASC
) to_agg;
$$ LANGUAGE SQL IMMUTABLE;
@dbenhur
Copy link

dbenhur commented Apr 6, 2015

I believe you can lose a layer of subselect by using first_value over the window frame instead of rank() and a subsequent filter. https://gist.github.com/dbenhur/951b7bae4e07d7998268

@drob
Copy link
Author

drob commented Apr 15, 2015

Played around with that, but it actually made the function 2x slower for test arrays of 1M elements! (Replied here as well: https://news.ycombinator.com/item?id=9328688)

I'm going to see if I can figure out why this is. This is definitely surprising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment