Skip to content

Instantly share code, notes, and snippets.

@RajaShyam
Created May 27, 2018 08:35
Show Gist options
  • Save RajaShyam/7e86c81723ad639f79e60c23806c0908 to your computer and use it in GitHub Desktop.
Save RajaShyam/7e86c81723ad639f79e60c23806c0908 to your computer and use it in GitHub Desktop.
Basics on ORC file format
ORC File Basics:
================
- Columnar format: Enables user to read & decompress just the bytes(pieces) they need
- Fast
- Indexed - Can jump into middle of file
- Self describing - Includes all info about type and encoding
- Rich type system - Supports wide complex types such as - timestamp, struct, map, list and union
File compatibility:
==================
- Backword compatibility
Automatically detect the version of the file and read it.
- Forward compatiblity
Most changes are made so old readers read all the new files.
Maintain the ability to write old files via orc.write.format
File Structure:
==============
- Files contain list of stripes, which are set of rows
- Default size is 64 MB
- Large stripe size enables efficient reads
- Footer
contains list of stripe locations
Type description
File and stripe statistics
- Postscript
compression params and file format version
From Hive or Presto:
====================
1. Create table
create table my_table (
name string,
address string,
) stored as orc;
2. Import data
insert overwrite table my_table select * from my_staging;
3. By default compression is gzip and if we want to override we can use below
- tblproperties("orc.compress"="None")
- set hive.exec.orc.default.compress=NONE
Using commandline:
==================
1. Use <hive --orcfiledump> for printing ouput of file
-j-p - Pretty prints metadata as JSON
-d - Prints data as JSON
2. Using java -jar orc-tools-1.4.0-uber.jar from ORC
- meta - prints metadata as JSON
- data - prints data as JSON
- convert - converts JSON to ORC
- json-schema - scan a set of JSON documents to find matching schema
Stripe size:
============
Makes huge difference in performance
- orc.stripe.size or hive.exec.orc.default.stripe.size
- Controlls the amount of buffer in writer. Default is 64MB
- Trade off
- Large stripes = Large more efficient reads
- Small stripes = Less memory and more granular processing splits
- The stripes does not align exactly with HDFS blocks, for that we can explictly tell to set hive.exec.orc.default.block.padding=true
so that it will pad to block boundaries.
Predicate push down
======================
- Reader is given a search argument
- ORC Indexes operate at 3 levels
- File
- Stripe
- Row group
- diffrence b/w ORC and parquet is - Parquet does not do any indexing at ROW group level, the only level they can do predicate psuh down is at
stripe level
- Parquet does not support bloom filters
Bloom filters: probablistic bitmap of hashcodes
- Using bloom filters help is in reading less rows and that takes less time to identify a record
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment