Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@vatshat
Forked from pierdom/hive_optimization.md
Created May 28, 2020 12:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save vatshat/045d44d3c25689d7f4d0797b549c36b8 to your computer and use it in GitHub Desktop.
Save vatshat/045d44d3c25689d7f4d0797b549c36b8 to your computer and use it in GitHub Desktop.
[Hive performance tuning]. Source: http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ #bigdata #hive #sysadmin

Five ways to tune Hive performance

1. Use Tez

set hive.execution.engine=tez;

2. Store tables as ORC

CREATE TABLE ORC_TABLE (
  customerID int, 
  name string, 
  age int, 
  address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);

3. Use Vectorization

set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;

4. Cost-based Query Optimization

set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

Then collect table statistics:

analyze table tweets compute statistics;
analyze table tweets compute statistics for columns;

5. Avoid JOIN when possible

Example. A query like this:

SELECT clicks.* 
FROM clicks inner join (
    select sessionID, max(timestamp) as max_ts 
    from clicks
    group by sessionID
  ) latest ON clicks.sessionID = latest.sessionID and
clicks.timestamp = latest.max_ts;

could be re-written like this:

SELECT * 
FROM (
  SELECT *, 
        RANK() over (partition by sessionID,
                        order by timestamp desc) as rank
  FROM clicks
) ranked_clicks
WHERE ranked_clicks.rank=1;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment