gokulsan/Impala_Quick_Reference.txt

## Impala_Quick_Reference.txt
Impala and Big Data Ecosystem

When it comes to SQL-on-Hadoop, there are handful frameworks available in market. Hive and Impala are most widely used to build data warehouse on the Hadoop framework.

Impala Table Partioning

Partitioning Impala table technique physically divides the data based on the different values in frequently queried or used columns in the impala tables. This technique allows queries to skip reading a large percentage of the data in a table, thus reducing the I/O operation and speed-up overall performance.

Impala Schema Design

Joins are important aspects of the SQL queries. Avoid using correlated queries and inline tables. Create temporary tables and try to use inner join wherever possible. Generate stats on the column and table level.

Impala File Format Selection

Typically, for large volume of the data, the Parquet file format performs best because of its combination of columnar storage layout, large I/O request size, and compression and encoding.

Impala Metadata Load Techniques

Impala caches metadata for speed. The caching mechanism requires loading metadata from persistent stores, like Hive MetaStore, NameNode, and Sentry by CatalogD. This is subsequently compressed and sent to the Statestore to be broadcast to dedicated coordinators. Such a complex system is easily subject to numerous bottlenecks which make it imperative to monitor the key relationships among Impala's components.

Impala Metadata Load Antipatterns

Computing incremental stats on wide (large number of columns) partitioned tables
Large number of partitions/files/blocks[2] (click here for more information)
Constantly and frequent REFRESHof large tables
Indiscriminate use of INVALIDATE METADATAcommands
High number of concurrent DDL operations[3]
Catalog or Statestore service restarts
High number of coordinator nodes ( > 10% of nodes on a cluster >= 150 nodes)

Impala Case Study
https://www.dezyre.com/hadoop-tutorial/impala-case-study-flight-data-analysis

Competitor Landscape -
Apache Presto -
https://www.tutorialspoint.com/apache_presto/apache_presto_overview.htm
https://en.wikipedia.org/wiki/Presto_(SQL_query_engine)
Google Spanner - https://en.wikipedia.org/wiki/Spanner_(database)
Google Dremel - https://en.wikipedia.org/wiki/Dremel_(software)
Apache Drill - https://en.wikipedia.org/wiki/Apache_Drill
Apache Hive - https://en.wikipedia.org/wiki/Apache_Hive


Impala Assessment Metrics

https://www.marketscreener.com/news/Cloudera-Assessment-of-Apache-Impala-Performance-using-Cloudera-Manager-Metrics-ndash-Part-1-of--27746017/

Impala FAQ
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_faq.html

Impala Scalability
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_scalability.html

Impala Performance Benchmarking
https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_perf_benchmarking.html
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_perf_testing.html

Impala Presentations
https://cwiki.apache.org/confluence/display/IMPALA/Impala+Presentations%2C+Papers+and+Blog+Posts

Impala Resource Management
https://cwiki.apache.org/confluence/display/IMPALA/Resource+Management+Best+Practices+in+Impala

Impala Best Practices
https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html
http://dwgeek.com/cloudera-impala-performance-tuning-best-practices.html/
http://hadooptutorial.info/impala-best-practices/
	Impala and Big Data Ecosystem

	When it comes to SQL-on-Hadoop, there are handful frameworks available in market. Hive and Impala are most widely used to build data warehouse on the Hadoop framework.

	Impala Table Partioning

	Partitioning Impala table technique physically divides the data based on the different values in frequently queried or used columns in the impala tables. This technique allows queries to skip reading a large percentage of the data in a table, thus reducing the I/O operation and speed-up overall performance.

	Impala Schema Design

	Joins are important aspects of the SQL queries. Avoid using correlated queries and inline tables. Create temporary tables and try to use inner join wherever possible. Generate stats on the column and table level.

	Impala File Format Selection

	Typically, for large volume of the data, the Parquet file format performs best because of its combination of columnar storage layout, large I/O request size, and compression and encoding.

	Impala Metadata Load Techniques

	Impala caches metadata for speed. The caching mechanism requires loading metadata from persistent stores, like Hive MetaStore, NameNode, and Sentry by CatalogD. This is subsequently compressed and sent to the Statestore to be broadcast to dedicated coordinators. Such a complex system is easily subject to numerous bottlenecks which make it imperative to monitor the key relationships among Impala's components.

	Impala Metadata Load Antipatterns

	Computing incremental stats on wide (large number of columns) partitioned tables
	Large number of partitions/files/blocks[2] (click here for more information)
	Constantly and frequent REFRESHof large tables
	Indiscriminate use of INVALIDATE METADATAcommands
	High number of concurrent DDL operations[3]
	Catalog or Statestore service restarts
	High number of coordinator nodes ( > 10% of nodes on a cluster >= 150 nodes)

	Impala Case Study
	https://www.dezyre.com/hadoop-tutorial/impala-case-study-flight-data-analysis

	Competitor Landscape -
	Apache Presto -
	https://www.tutorialspoint.com/apache_presto/apache_presto_overview.htm
	https://en.wikipedia.org/wiki/Presto_(SQL_query_engine)
	Google Spanner - https://en.wikipedia.org/wiki/Spanner_(database)
	Google Dremel - https://en.wikipedia.org/wiki/Dremel_(software)
	Apache Drill - https://en.wikipedia.org/wiki/Apache_Drill
	Apache Hive - https://en.wikipedia.org/wiki/Apache_Hive


	Impala Assessment Metrics

	https://www.marketscreener.com/news/Cloudera-Assessment-of-Apache-Impala-Performance-using-Cloudera-Manager-Metrics-ndash-Part-1-of--27746017/

	Impala FAQ
	https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_faq.html

	Impala Scalability
	https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_scalability.html

	Impala Performance Benchmarking
	https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html
	https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_perf_benchmarking.html
	https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_perf_testing.html

	Impala Presentations
	https://cwiki.apache.org/confluence/display/IMPALA/Impala+Presentations%2C+Papers+and+Blog+Posts

	Impala Resource Management
	https://cwiki.apache.org/confluence/display/IMPALA/Resource+Management+Best+Practices+in+Impala

	Impala Best Practices
	https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html
	http://dwgeek.com/cloudera-impala-performance-tuning-best-practices.html/
	http://hadooptutorial.info/impala-best-practices/