Skip to content

Instantly share code, notes, and snippets.

@sanp
Last active September 27, 2017 07:45
Show Gist options
  • Save sanp/9ca620290812680723a998c5879d8c3e to your computer and use it in GitHub Desktop.
Save sanp/9ca620290812680723a998c5879d8c3e to your computer and use it in GitHub Desktop.

Joins using Where clause vs on clause

Hive and postgres handle where vs on clauses differently. Postgres' query engine is smarter: where and on clause joins will be handled the same. In Hive, where clause is more efficient than on clause.

Stats:

Hive:

On clause: In stage 1, pulls in ~400MM records; takes ~13 minutes to execute

Where clause: In stage 1, pulls in ~60MM records; takes ~5 minutes to execute

Postgres:

Both queries have the same query plan (see below), and take the same amount of time to execute.

Hive

Where Clause

explain
select
  w.auction_id
  , w.date_wid
  , e.event_id
  , e.type
  , e.sub_type
  , e.winning_price
  , w.data_cost
  , w.win_type
  , w.advertiser_id
  , w.revenue
from auctions_won w
left join auction_events e on (w.auction_id = e.auction_id)
where w.date_wid = 20170604
;
STAGE DEPENDENCIES:
2	  Stage-4 is a root stage , consists of Stage-1
3	  Stage-1
4	  Stage-0 depends on stages: Stage-1
5	
6	STAGE PLANS:
7	  Stage: Stage-4
8	    Conditional Operator
9	
10	  Stage: Stage-1
11	    Map Reduce
12	      Map Operator Tree:
13	          TableScan
14	            alias: w
15	            filterExpr: (date_wid = 20170604) (type: boolean)
16	            Statistics: Num rows: 62861427 Data size: 31430713565 Basic stats: COMPLETE Column stats: NONE
17	            Reduce Output Operator
18	              key expressions: auction_id (type: string)
19	              sort order: +
20	              Map-reduce partition columns: auction_id (type: string)
21	              Statistics: Num rows: 62861427 Data size: 31430713565 Basic stats: COMPLETE Column stats: NONE
22	              value expressions: data_cost (type: string), revenue (type: string), win_type (type: string), advertiser_id (type: string)
23	          TableScan
24	            alias: e
25	            Statistics: Num rows: 1588825 Data size: 794418447 Basic stats: COMPLETE Column stats: NONE
26	            Reduce Output Operator
27	              key expressions: auction_id (type: string)
28	              sort order: +
29	              Map-reduce partition columns: auction_id (type: string)
30	              Statistics: Num rows: 1588825 Data size: 794418447 Basic stats: COMPLETE Column stats: NONE
31	              value expressions: event_id (type: string), type (type: string), sub_type (type: string), winning_price (type: string)
32	      Reduce Operator Tree:
33	        Join Operator
34	          condition map:
35	               Left Outer Join0 to 1
36	          keys:
37	            0 auction_id (type: string)
38	            1 auction_id (type: string)
39	          outputColumnNames: _col0, _col2, _col4, _col9, _col59, _col141, _col143, _col144, _col151
40	          Statistics: Num rows: 69147571 Data size: 34573785670 Basic stats: COMPLETE Column stats: NONE
41	          Select Operator
42	            expressions: _col0 (type: string), 20170604 (type: bigint), _col141 (type: string), _col143 (type: string), _col144 (type: string), _col151 (type: string), _col2 (type: string), _col9 (type: string), _col59 (type: string), _col4 (type: string)
43	            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9
44	            Statistics: Num rows: 69147571 Data size: 34573785670 Basic stats: COMPLETE Column stats: NONE
45	            File Output Operator
46	              compressed: false
47	              Statistics: Num rows: 69147571 Data size: 34573785670 Basic stats: COMPLETE Column stats: NONE
48	              table:
49	                  input format: org.apache.hadoop.mapred.TextInputFormat
50	                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
51	                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
52	
53	  Stage: Stage-0
54	    Fetch Operator
55	      limit: -1
56	      Processor Tree:
57	        ListSink
58	

On clause

explain
select
  w.auction_id
  , w.date_wid
  , e.event_id
  , e.type
  , e.sub_type
  , e.winning_price
  , w.data_cost
  , w.win_type
  , w.advertiser_id
  , w.revenue
from auctions_won w
left join auction_events e on (
  w.auction_id = e.auction_id 
  and w.date_wid = 20170604)
STAGE DEPENDENCIES:
2	  Stage-4 is a root stage , consists of Stage-1
3	  Stage-1
4	  Stage-0 depends on stages: Stage-1
5	
6	STAGE PLANS:
7	  Stage: Stage-4
8	    Conditional Operator
9	
10	  Stage: Stage-1
11	    Map Reduce
12	      Map Operator Tree:
13	          TableScan
14	            alias: w
15	            Statistics: Num rows: 488498286 Data size: 244249144664 Basic stats: COMPLETE Column stats: PARTIAL
16	            Reduce Output Operator
17	              key expressions: auction_id (type: string)
18	              sort order: +
19	              Map-reduce partition columns: auction_id (type: string)
20	              Statistics: Num rows: 488498286 Data size: 244249144664 Basic stats: COMPLETE Column stats: PARTIAL
21	              value expressions: data_cost (type: string), revenue (type: string), win_type (type: string), advertiser_id (type: string), date_wid (type: bigint)
22	          TableScan
23	            alias: e
24	            Statistics: Num rows: 1588825 Data size: 794418447 Basic stats: COMPLETE Column stats: NONE
25	            Reduce Output Operator
26	              key expressions: auction_id (type: string)
27	              sort order: +
28	              Map-reduce partition columns: auction_id (type: string)
29	              Statistics: Num rows: 1588825 Data size: 794418447 Basic stats: COMPLETE Column stats: NONE
30	              value expressions: event_id (type: string), type (type: string), sub_type (type: string), winning_price (type: string)
31	      Reduce Operator Tree:
32	        Join Operator
33	          condition map:
34	               Left Outer Join0 to 1
35	          filter predicates:
36	            0 {(VALUE._col131 = 20170604)}
37	            1 
38	          keys:
39	            0 auction_id (type: string)
40	            1 auction_id (type: string)
41	          outputColumnNames: _col0, _col2, _col4, _col9, _col59, _col132, _col141, _col143, _col144, _col151
42	          Statistics: Num rows: 537348126 Data size: 268674064953 Basic stats: COMPLETE Column stats: NONE
43	          Select Operator
44	            expressions: _col0 (type: string), _col132 (type: bigint), _col141 (type: string), _col143 (type: string), _col144 (type: string), _col151 (type: string), _col2 (type: string), _col9 (type: string), _col59 (type: string), _col4 (type: string)
45	            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9
46	            Statistics: Num rows: 537348126 Data size: 268674064953 Basic stats: COMPLETE Column stats: NONE
47	            File Output Operator
48	              compressed: false
49	              Statistics: Num rows: 537348126 Data size: 268674064953 Basic stats: COMPLETE Column stats: NONE
50	              table:
51	                  input format: org.apache.hadoop.mapred.TextInputFormat
52	                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
53	                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
54	
55	  Stage: Stage-0
56	    Fetch Operator
57	      limit: -1
58	      Processor Tree:
59	        ListSink

Postgres

Where Clause

dm_production=> explain select
dm_production->   w.auction_id
dm_production->   , e.event_id
dm_production->   , e.type
dm_production-> from won_1491319843_part0 w
dm_production-> left join events_20170404 e on (w.auction_id = e.auction_id)
dm_production-> where e.type = 'WIN_EVENT'
dm_production-> ;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Hash Join  (cost=295492.13..2504549.42 rows=5303291 width=60)
   Hash Cond: (e.auction_id = w.auction_id)
   ->  Seq Scan on events_20170404 e  (cost=0.00..2014161.04 rows=5303291 width=60)
         Filter: (type = 'WIN_EVENT'::text)
   ->  Hash  (cost=272756.28..272756.28 rows=1175828 width=25)
         ->  Seq Scan on won_1491319843_part0 w  (cost=0.00..272756.28 rows=1175828 width=25)
(6 rows)

On Clause

dm_production=> explain
dm_production-> select
dm_production->   w.auction_id
dm_production->   , e.event_id
dm_production->   , e.type
dm_production-> from won_1491319843_part0 w
dm_production-> left join events_20170404 e on (
dm_production(>   w.auction_id = e.auction_id
dm_production(>   and e.type = 'WIN_EVENT')
dm_production-> ;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Hash Right Join  (cost=295492.13..2504549.42 rows=5303291 width=60)
   Hash Cond: (e.auction_id = w.auction_id)
   ->  Seq Scan on events_20170404 e  (cost=0.00..2014161.04 rows=5303291 width=60)
         Filter: (type = 'WIN_EVENT'::text)
   ->  Hash  (cost=272756.28..272756.28 rows=1175828 width=25)
         ->  Seq Scan on won_1491319843_part0 w  (cost=0.00..272756.28 rows=1175828 width=25)
(6 rows)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment