Skip to content

Instantly share code, notes, and snippets.

@jamesrajendran
Created June 17, 2017 13:15
Show Gist options
  • Save jamesrajendran/3718f2a177dfc608c5dab9140f716415 to your computer and use it in GitHub Desktop.
Save jamesrajendran/3718f2a177dfc608c5dab9140f716415 to your computer and use it in GitHub Desktop.
hive tuning hints mapjoin bucketmapjoin - partition-bucket design
1.MapJoin:
small tables can be loaded in memory and joined with bigger tables.
1. use hint /*+ MAPJOIN(table_name) */
2. 'better' option - let hive do automatically by setting these properties:
hive.auto.convert.join - true
hive.mapjoin.smalltable.filesize = <> default is 25MB
2.Partition Design
Low cardinality column -eg, regiou, year
3.Bucket Design:
For high cardinality columns like itineraryIds, and joining tables with the similar Ids.
both tables be bucketed on the same column.
The number of buckets in both tables be strict multiple of each other.
1.set hive.enforce.bucketing=true
2.set hive.optimize.bucketmapjoin=true
3.use hint - /*+ MAPJOIN(a,b) */
4.BucketSampling:
Due to hashing, simiar ids will be grouped in the similar bucket ids.
use tablesample clause and pass same bucket number for both tables - only those two buckets will be sampled and they will have related rows as well.
select ... from <table1> TABLESAMPLE(bucket 30 out of 64 on <bucketed column>)
, <table2> TABLESAMPLE(bucket 30 out of 64 on <bucketed column>)
where ...
5.BlockSampling:
For single table analysis - no buckets
can sample with % of bytes, number of bytes, number of rows
eg: TABLESAMPLE(1 PERCENT), TABLESAMPLE(10 ROWS), TABLESAMPLE(100M)
6. Hive parallel Executeion:
Hive genrally breaks down the work into multiple MRs and run them sequentially - set hive.exec.parallel=false
But some of these MRs can run independently, they will run in parallel - set hive.exec.parallel = true
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment