RajaShyam/Orc_basics

## Orc_basics
ORC File Basics:
================
- Columnar format: Enables user to read & decompress just the bytes(pieces) they need
- Fast
- Indexed - Can jump into middle of file
- Self describing - Includes all info about type and encoding
- Rich type system - Supports wide complex types such as - timestamp, struct, map, list and union

File compatibility:
==================
- Backword compatibility
   Automatically detect the version of the file and read it.
- Forward compatiblity
  Most changes are made so old readers read all the new files.
  Maintain the ability to write old files via orc.write.format

File Structure:
==============
- Files contain list of stripes, which are set of rows
- Default size is 64 MB
- Large stripe size enables efficient reads
- Footer
  contains list of stripe locations
  Type description
  File and stripe statistics
- Postscript
  compression params and file format version

From Hive or Presto:
====================
1. Create table
  create table my_table (
      name string,
      address string,
  ) stored as orc;
2. Import data
  insert overwrite table my_table select * from my_staging;
3. By default compression is gzip and if we want to override we can use below
 - tblproperties("orc.compress"="None")
 - set hive.exec.orc.default.compress=NONE

Using commandline:
==================
1. Use <hive --orcfiledump> for printing ouput of file
-j-p - Pretty prints metadata as JSON
-d - Prints data as JSON
2. Using java -jar orc-tools-1.4.0-uber.jar from ORC
 - meta - prints metadata as JSON
 - data - prints data as JSON
 - convert - converts JSON to ORC
 - json-schema - scan a set of JSON documents to find matching schema

Stripe size:
============
Makes huge difference in performance
- orc.stripe.size or hive.exec.orc.default.stripe.size
- Controlls the amount of buffer in writer. Default is 64MB
- Trade off
  - Large stripes = Large more efficient reads
  - Small stripes = Less memory and more granular processing splits
- The stripes does not align exactly with HDFS blocks, for that we can explictly tell to set hive.exec.orc.default.block.padding=true
so that it will pad to block boundaries.

Predicate push down
======================
  - Reader is given a search argument
  - ORC Indexes operate at 3 levels
    - File
    - Stripe
    - Row group
   - diffrence b/w ORC and parquet is - Parquet does not do any indexing at ROW group level, the only level they can do predicate psuh down is at
   stripe level
   - Parquet does not support bloom filters
     Bloom filters: probablistic bitmap of hashcodes
   - Using bloom filters help is in reading less rows and that takes less time to identify a record
	ORC File Basics:
	================
	- Columnar format: Enables user to read & decompress just the bytes(pieces) they need
	- Fast
	- Indexed - Can jump into middle of file
	- Self describing - Includes all info about type and encoding
	- Rich type system - Supports wide complex types such as - timestamp, struct, map, list and union

	File compatibility:
	==================
	- Backword compatibility
	Automatically detect the version of the file and read it.
	- Forward compatiblity
	Most changes are made so old readers read all the new files.
	Maintain the ability to write old files via orc.write.format

	File Structure:
	==============
	- Files contain list of stripes, which are set of rows
	- Default size is 64 MB
	- Large stripe size enables efficient reads
	- Footer
	contains list of stripe locations
	Type description
	File and stripe statistics
	- Postscript
	compression params and file format version

	From Hive or Presto:
	====================
	1. Create table
	create table my_table (
	name string,
	address string,
	) stored as orc;
	2. Import data
	insert overwrite table my_table select * from my_staging;
	3. By default compression is gzip and if we want to override we can use below
	- tblproperties("orc.compress"="None")
	- set hive.exec.orc.default.compress=NONE

	Using commandline:
	==================
	1. Use <hive --orcfiledump> for printing ouput of file
	-j-p - Pretty prints metadata as JSON
	-d - Prints data as JSON
	2. Using java -jar orc-tools-1.4.0-uber.jar from ORC
	- meta - prints metadata as JSON
	- data - prints data as JSON
	- convert - converts JSON to ORC
	- json-schema - scan a set of JSON documents to find matching schema

	Stripe size:
	============
	Makes huge difference in performance
	- orc.stripe.size or hive.exec.orc.default.stripe.size
	- Controlls the amount of buffer in writer. Default is 64MB
	- Trade off
	- Large stripes = Large more efficient reads
	- Small stripes = Less memory and more granular processing splits
	- The stripes does not align exactly with HDFS blocks, for that we can explictly tell to set hive.exec.orc.default.block.padding=true
	so that it will pad to block boundaries.

	Predicate push down
	======================
	- Reader is given a search argument
	- ORC Indexes operate at 3 levels
	- File
	- Stripe
	- Row group
	- diffrence b/w ORC and parquet is - Parquet does not do any indexing at ROW group level, the only level they can do predicate psuh down is at
	stripe level
	- Parquet does not support bloom filters
	Bloom filters: probablistic bitmap of hashcodes
	- Using bloom filters help is in reading less rows and that takes less time to identify a record