Skip to content

Instantly share code, notes, and snippets.

@andykram
Last active August 29, 2015 13:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save andykram/9201077 to your computer and use it in GitHub Desktop.
Save andykram/9201077 to your computer and use it in GitHub Desktop.

Assuming:

  • table1 and table2 both have a date partition with a column name of d
  • table2 references table1 by the column t1_id
  • table1 and table2 are stored in HDFS as RCFile

The following correctly produces NULL (empty) values:

WITH
    t1 AS (SELECT t1.id FROM table1 t1 WHERE t1.d='2014-02-24'),
    t2 AS (SELECT t2.t1_id, t2.id FROM table2 t2 WHERE t2.d='2014-02-24')
SELECT
    t1.id,
    t2.t1_id,
    t2.id
FROM t1
LEFT OUTER JOIN t2 ON t1.id = t2.t1_id

However, the following incorrectly produces no NULL (empty) values where there are no table2 with table2.t1_id = table1.id:

SELECT
    t1.id,
    t2.t1_id,
    t2.id
FROM t1
LEFT OUTER JOIN t2 ON t1.id = t2.t1_id
WHERE
    t1.d = '2014-02-24' AND
    t2.d = '2014-02-24'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment