Skip to content

Instantly share code, notes, and snippets.

@gokulsan
Last active May 26, 2019 18:39
Show Gist options
  • Save gokulsan/d75070b4ac82227c5d97074cf8930668 to your computer and use it in GitHub Desktop.
Save gokulsan/d75070b4ac82227c5d97074cf8930668 to your computer and use it in GitHub Desktop.
Apache Impala as the Open Source SQL Engine
Impala and Big Data Ecosystem
When it comes to SQL-on-Hadoop, there are handful frameworks available in market. Hive and Impala are most widely used to build data warehouse on the Hadoop framework.
Impala Table Partioning
Partitioning Impala table technique physically divides the data based on the different values in frequently queried or used columns in the impala tables. This technique allows queries to skip reading a large percentage of the data in a table, thus reducing the I/O operation and speed-up overall performance.
Impala Schema Design
Joins are important aspects of the SQL queries. Avoid using correlated queries and inline tables. Create temporary tables and try to use inner join wherever possible. Generate stats on the column and table level.
Impala File Format Selection
Typically, for large volume of the data, the Parquet file format performs best because of its combination of columnar storage layout, large I/O request size, and compression and encoding.
Impala Metadata Load Techniques
Impala caches metadata for speed. The caching mechanism requires loading metadata from persistent stores, like Hive MetaStore, NameNode, and Sentry by CatalogD. This is subsequently compressed and sent to the Statestore to be broadcast to dedicated coordinators. Such a complex system is easily subject to numerous bottlenecks which make it imperative to monitor the key relationships among Impala's components.
Impala Metadata Load Antipatterns
Computing incremental stats on wide (large number of columns) partitioned tables
Large number of partitions/files/blocks[2] (click here for more information)
Constantly and frequent REFRESHof large tables
Indiscriminate use of INVALIDATE METADATAcommands
High number of concurrent DDL operations[3]
Catalog or Statestore service restarts
High number of coordinator nodes ( > 10% of nodes on a cluster >= 150 nodes)
Impala Case Study
https://www.dezyre.com/hadoop-tutorial/impala-case-study-flight-data-analysis
Competitor Landscape -
Apache Presto -
https://www.tutorialspoint.com/apache_presto/apache_presto_overview.htm
https://en.wikipedia.org/wiki/Presto_(SQL_query_engine)
Google Spanner - https://en.wikipedia.org/wiki/Spanner_(database)
Google Dremel - https://en.wikipedia.org/wiki/Dremel_(software)
Apache Drill - https://en.wikipedia.org/wiki/Apache_Drill
Apache Hive - https://en.wikipedia.org/wiki/Apache_Hive
Impala Assessment Metrics
https://www.marketscreener.com/news/Cloudera-Assessment-of-Apache-Impala-Performance-using-Cloudera-Manager-Metrics-ndash-Part-1-of--27746017/
Impala FAQ
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_faq.html
Impala Scalability
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_scalability.html
Impala Performance Benchmarking
https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_perf_benchmarking.html
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_perf_testing.html
Impala Presentations
https://cwiki.apache.org/confluence/display/IMPALA/Impala+Presentations%2C+Papers+and+Blog+Posts
Impala Resource Management
https://cwiki.apache.org/confluence/display/IMPALA/Resource+Management+Best+Practices+in+Impala
Impala Best Practices
https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html
http://dwgeek.com/cloudera-impala-performance-tuning-best-practices.html/
http://hadooptutorial.info/impala-best-practices/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment