Instantly share code, notes, and snippets.

Embed
What would you like to do?
DISTINCT ON like functionality for Redshift

distinct column -> For each row returned, return only the unique members of a set. Think of it as for each row in a projection, concatenate all the column values and return only the strings that are unique.

test_db=# SELECT DISTINCT parent_id, child_id, id FROM test.foo_table ORDER BY parent_id, child_id, id LIMIT 10;
parent_id | child_id | id
-----------+------------+-----------------------------
1000040 | 103 | 1000040|2645405726|0001|103
1000040 | 103 | 1000040|2650805748|0002|103
1000040 | 103 | 1000040|2653406206|0001|103
1000040 | 108 | 1000040|2645405726|0001|108
1000040 | 108 | 1000040|2653406206|0001|108
1000040 | 113 | 1000040|2645405726|0001|113
1000040 | 113 | 1000040|2653406206|0001|113
1000040 | 117 | 1000040|2645405726|0001|117
1000040 | 117 | 1000040|2653406206|0001|117
1000040 | 118 | 1000040|2645405726|0001|118
(10 rows)

Each row is unique. However, notice that are multiple rows for pairs of parent_id and child_id.

PostgreSQL has a nice enhancement to distinct that allows us to easily get only one row of a group of values in a particular column.

distinct on (column) -> Do a distinct, but only give me one record in a set of rows with the same field value in the named column.

test_db=# SELECT DISTINCT ON (parent_id) parent_id, child_id, id FROM test.foo_table ORDER BY parent_id, child_id, id limit 10;
parent_id | child_id | id
-----------+------------+-----------------------------
1000040 | 103 | 1000040|2645405726|0001|103
1000046 | 103 | 1000046|2664405890|0001|103
100008 | 103 | 100008|2601400960|0001|103
1000168 | 103 | 1000168|2461006072|0001|103
1000212 | 103 | 1000212|2405206458|0001|103
1000216 | 103 | 1000216|2642205628|0001|103
1000524 | 103 | 1000524|2459806672|0001|103
1000526 | 103 | 1000526|2458206280|0001|103
1000528 | 103 | 1000528|2422005896|0001|103
1000562 | 103 | 1000562|2808805598|0001|103
(10 rows)

Sadly, Redshift and other popular DBMSs do not have this enhancement.

Convert using rank():

SELECT 
        DISTINCT ON (parent_id) parent_id, 
        child_id, id 
FROM test.foo_table 
ORDER BY parent_id, child_id, id 
LIMIT 10;

to

SELECT *
FROM 
(SELECT parent_id, 
        child_id, 
        id, 
        rank() OVER (PARTITION BY parent_id ORDER BY child_id, id) AS parent_id_ranked
 FROM test.foo_table ORDER BY parent_id, child_id, id LIMIT 10
) AS ranked
WHERE ranked.parent_id_ranked = 1;
@heyalexchoi

This comment has been minimized.

Show comment
Hide comment
@heyalexchoi

heyalexchoi Dec 22, 2015

good god thank you

heyalexchoi commented Dec 22, 2015

good god thank you

@austinjalexander

This comment has been minimized.

Show comment
Hide comment
@austinjalexander

austinjalexander Jan 29, 2016

Also check out ROW_NUMBER Window Function (example).

austinjalexander commented Jan 29, 2016

Also check out ROW_NUMBER Window Function (example).

@filitchp

This comment has been minimized.

Show comment
Hide comment
@filitchp

filitchp Dec 2, 2016

thank you, this is amazing!

filitchp commented Dec 2, 2016

thank you, this is amazing!

@benoittgt

This comment has been minimized.

Show comment
Hide comment
@benoittgt

benoittgt Mar 2, 2017

I prefer to use ROW_NUMBER() rather than RANK() because is deterministic. See : http://dba.stackexchange.com/a/165945/100437

If columns in PARTITION BY and ORDER have the same value they will have the same rank (1) and will not be marked as duplicated.

I would prefer :

SELECT *
FROM
  (SELECT parent_id,
          child_id,
          id,
          ROW_NUMBER() OVER (PARTITION BY parent_id
                             ORDER BY child_id, id) AS parent_id_ranked
   FROM test.foo_table
   ORDER BY parent_id,
            child_id,
            id
   LIMIT 10) AS ranked
WHERE ranked.parent_id_ranked = 1;

benoittgt commented Mar 2, 2017

I prefer to use ROW_NUMBER() rather than RANK() because is deterministic. See : http://dba.stackexchange.com/a/165945/100437

If columns in PARTITION BY and ORDER have the same value they will have the same rank (1) and will not be marked as duplicated.

I would prefer :

SELECT *
FROM
  (SELECT parent_id,
          child_id,
          id,
          ROW_NUMBER() OVER (PARTITION BY parent_id
                             ORDER BY child_id, id) AS parent_id_ranked
   FROM test.foo_table
   ORDER BY parent_id,
            child_id,
            id
   LIMIT 10) AS ranked
WHERE ranked.parent_id_ranked = 1;
@wushugene

This comment has been minimized.

Show comment
Hide comment
@wushugene

wushugene Aug 17, 2017

@benoittgt that's the exact scenario I encountered. thanks!

wushugene commented Aug 17, 2017

@benoittgt that's the exact scenario I encountered. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment