Created
May 27, 2018 08:35
-
-
Save RajaShyam/7e86c81723ad639f79e60c23806c0908 to your computer and use it in GitHub Desktop.
Basics on ORC file format
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ORC File Basics: | |
================ | |
- Columnar format: Enables user to read & decompress just the bytes(pieces) they need | |
- Fast | |
- Indexed - Can jump into middle of file | |
- Self describing - Includes all info about type and encoding | |
- Rich type system - Supports wide complex types such as - timestamp, struct, map, list and union | |
File compatibility: | |
================== | |
- Backword compatibility | |
Automatically detect the version of the file and read it. | |
- Forward compatiblity | |
Most changes are made so old readers read all the new files. | |
Maintain the ability to write old files via orc.write.format | |
File Structure: | |
============== | |
- Files contain list of stripes, which are set of rows | |
- Default size is 64 MB | |
- Large stripe size enables efficient reads | |
- Footer | |
contains list of stripe locations | |
Type description | |
File and stripe statistics | |
- Postscript | |
compression params and file format version | |
From Hive or Presto: | |
==================== | |
1. Create table | |
create table my_table ( | |
name string, | |
address string, | |
) stored as orc; | |
2. Import data | |
insert overwrite table my_table select * from my_staging; | |
3. By default compression is gzip and if we want to override we can use below | |
- tblproperties("orc.compress"="None") | |
- set hive.exec.orc.default.compress=NONE | |
Using commandline: | |
================== | |
1. Use <hive --orcfiledump> for printing ouput of file | |
-j-p - Pretty prints metadata as JSON | |
-d - Prints data as JSON | |
2. Using java -jar orc-tools-1.4.0-uber.jar from ORC | |
- meta - prints metadata as JSON | |
- data - prints data as JSON | |
- convert - converts JSON to ORC | |
- json-schema - scan a set of JSON documents to find matching schema | |
Stripe size: | |
============ | |
Makes huge difference in performance | |
- orc.stripe.size or hive.exec.orc.default.stripe.size | |
- Controlls the amount of buffer in writer. Default is 64MB | |
- Trade off | |
- Large stripes = Large more efficient reads | |
- Small stripes = Less memory and more granular processing splits | |
- The stripes does not align exactly with HDFS blocks, for that we can explictly tell to set hive.exec.orc.default.block.padding=true | |
so that it will pad to block boundaries. | |
Predicate push down | |
====================== | |
- Reader is given a search argument | |
- ORC Indexes operate at 3 levels | |
- File | |
- Stripe | |
- Row group | |
- diffrence b/w ORC and parquet is - Parquet does not do any indexing at ROW group level, the only level they can do predicate psuh down is at | |
stripe level | |
- Parquet does not support bloom filters | |
Bloom filters: probablistic bitmap of hashcodes | |
- Using bloom filters help is in reading less rows and that takes less time to identify a record | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment