Skip to content

Instantly share code, notes, and snippets.

View PrathameshNimkar's full-sized avatar

Prathamesh Nimkar PrathameshNimkar

View GitHub Profile
{
"public_identifier": "prathameshnimkar",
"profile_pic_url": "https://s3.us-west-000.backblazeb2.com/proxycurl/person/prathameshnimkar/profile?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0004d7f56a0400b0000000001%2F20240321%2Fus-west-000%2Fs3%2Faws4_request&X-Amz-Date=20240321T085320Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=ac53dd3e66f9d9e56b00ea4621d51ca22602a3690ee432a8d7f5b7ade887a8dc",
"background_cover_image_url": "https://s3.us-west-000.backblazeb2.com/proxycurl/person/prathameshnimkar/cover?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0004d7f56a0400b0000000001%2F20240321%2Fus-west-000%2Fs3%2Faws4_request&X-Amz-Date=20240321T085320Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=8383763af56e1599ab6673ab6d80bcafa2a17b012736ac52abc2e4172647e7cc",
"first_name": "Prathamesh",
"last_name": "Nimkar",
"full_name": "Prathamesh Nimkar",
"follower_count": null,
"occupation": "Data Cloud Architect, GSI Partners at Snowflake",
"headline": "Data Cloud Architect,

str_dataset = LOAD '/user/root/projects/structuredFlightDataset/part-m-00000' USING PigStorage(',') AS (uid:int,FL_DATE:chararray,OP_UNIQUE_CARRIER:chararray,ORIGIN_AIRPORT_ID:int,ORIGIN_AIRPORT_SEQ_ID:int,ORIGIN_CITY_MARKET_ID:int,ORIGIN_CITY_NAME:chararray,DEST_AIRPORT_ID:int,DEST_AIRPORT_SEQ_ID:int,DEST_CITY_MARKET_ID:int,DEST_CITY_NAME:chararray,DEP_TIME:chararray,ARR_TIME:chararray);

str_dataset_filtered = FILTER str_dataset BY ORIGIN_CITY_NAME IN (''Atlanta',''Nashville',''Baltimore',''Dallas',''Houston');

  • Load & filter the data as before

strfil_dataset_fewcols = FOREACH str_dataset_filtered GENERATE uid, OP_UNIQUE_CARRIER, ORIGIN_CITY_NAME;

  • "FOREACH" & "GENERATE" used together to select only specific columns of interest

strfilfc_dataset_grouped = GROUP strfil_dataset_fewcols BY OP_UNIQUE_CARRIER;

  • "GROUP BY" a specific value

STORE strfilfc_dataset_grouped INTO '/user/root/projects/structuredFlightDataset/output';

  • "STORE" to push the data back into HDFS
uid FL_DATE OP_UNIQUE_CARRIER ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID ORIGIN_CITY_NAME DEST_AIRPORT_ID DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST_CITY_NAME DEP_TIME ARR_TIME
1 2020-01-01 EV 13930 1393007 30977 Chicago, IL 11977 1197705 31977 Green Bay, WI 1003 1117
2 2020-01-01 EV 15370 1537002 34653 Tulsa, OK 13930 1393007 30977 Chicago, IL 1027 1216

str_dataset = LOAD '/user/root/projects/structuredFlightDataset/part-m-00000' USING PigStorage(',') AS (uid:int,FL_DATE:chararray,OP_UNIQUE_CARRIER:chararray,ORIGIN_AIRPORT_ID:int,ORIGIN_AIRPORT_SEQ_ID:int,ORIGIN_CITY_MARKET_ID:int,ORIGIN_CITY_NAME:chararray,DEST_AIRPORT_ID:int,DEST_AIRPORT_SEQ_ID:int,DEST_CITY_MARKET_ID:int,DEST_CITY_NAME:chararray,DEP_TIME:chararray,ARR_TIME:chararray);

  • Load the data as before

str_dataset_filtered = FILTER str_dataset BY ORIGIN_CITY_NAME IN (''Atlanta',''Nashville',''Baltimore',''Dallas',''Houston');

  • Filter the data using the "FILTER" clause
  • "BY" column name
  • "IN" multiple values. You may use "==" if you have only 1 value to filter by
  • Single backslash can be used to escape the single quote error in data

DUMP str_dataset_filtered;

  • Display the data on the screen
@PrathameshNimkar
PrathameshNimkar / apache_pig_demo1.md
Last active June 16, 2020 05:28
apache_pig_demo1

str_dataset = LOAD '/user/root/projects/structuredFlightDataset/part-m-00000' USING PigStorage(',') AS (uid:int,FL_DATE:chararray,OP_UNIQUE_CARRIER:chararray,ORIGIN_AIRPORT_ID:int,ORIGIN_AIRPORT_SEQ_ID:int,ORIGIN_CITY_MARKET_ID:int,ORIGIN_CITY_NAME:chararray,DEST_AIRPORT_ID:int,DEST_AIRPORT_SEQ_ID:int,DEST_CITY_MARKET_ID:int,DEST_CITY_NAME:chararray,DEP_TIME:chararray,ARR_TIME:chararray);

  • Loading the dataset using the built-in Pig function "LOAD"
  • "PigStorage" (case-sensitive) is the default load function. "USING" clause and "PigStorage" when used together can be omitted as both are default
  • AS helps to add the schema directly

DESCRIBE str_dataset; ILLUSTRATE str_dataset;

  • You can use DESC or DESCRIBE or ILLUSTRATE operators to view the schema

DUMP str_dataset;

  • DUMP keyword is used to display the output on the screen
@PrathameshNimkar
PrathameshNimkar / apache_pig_optimizers.md
Last active June 7, 2020 05:09
Apache Pig Optimizers
Optimizers Description
PartitionFilterOptimizer Data is filtered while loading itself
PredicatePushdownOptimizer Same as previous filter, but doesn't always work as expected
ConstantCalculator Constants are evaluated first
PushUpFilter Apply filter immediately after data is loaded. Change seen in DAG
MergeFil
@PrathameshNimkar
PrathameshNimkar / RecordWriter_OutputFormat.md
Created May 29, 2020 16:45
RecordWriter OutputFormat
Output Format Description
Text Writes (k,v) pair on individual lines of text file (most commonly used)
SequenceFile Writes sequence files to output. Also used in intermediate Mapper output to HDFS
SequenceFileAsBinary Similar to SequenceFile, just that it is in binary format
Multiple Writes to files whose names are derived from output (k,v) pair
DB Writes to SQL/NoSQL databases
Input Format Description
KeyValueText One (k,v) pair per line
Text Key = line number and value = line (Most commonly used)
NLine Any number of lines can make the input split
MultiFile Multiple files in one split
SequenceFile Input file is a Hadoop sequence file which has serialized (k,v) pair
@PrathameshNimkar
PrathameshNimkar / Data_in_blocks.md
Last active June 6, 2020 16:19
Data in blocks
Block1 Sales (M) Block3 Sales (M)
USA 1 UK 1
Russia 1 USA 1
UK 1 China 1
France 1 UK 1
China 1 USA 1
Russia 1 China 1
UK 1 UK 1
France 1 USA 1
@PrathameshNimkar
PrathameshNimkar / Tesla_Sales_Data.md
Last active June 6, 2020 16:19
Tesla Sales Data
Country Sales(M)
USA 1
Russia 1
UK 1
France 1
China 1
Russia 1
UK 1
France 1