Prathamesh Nimkar PrathameshNimkar

## pn.json
{
	"public_identifier": "prathameshnimkar",
	"profile_pic_url": "https://s3.us-west-000.backblazeb2.com/proxycurl/person/prathameshnimkar/profile?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0004d7f56a0400b0000000001%2F20240321%2Fus-west-000%2Fs3%2Faws4_request&X-Amz-Date=20240321T085320Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=ac53dd3e66f9d9e56b00ea4621d51ca22602a3690ee432a8d7f5b7ade887a8dc",
	"background_cover_image_url": "https://s3.us-west-000.backblazeb2.com/proxycurl/person/prathameshnimkar/cover?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0004d7f56a0400b0000000001%2F20240321%2Fus-west-000%2Fs3%2Faws4_request&X-Amz-Date=20240321T085320Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=8383763af56e1599ab6673ab6d80bcafa2a17b012736ac52abc2e4172647e7cc",
	"first_name": "Prathamesh",
	"last_name": "Nimkar",
	"full_name": "Prathamesh Nimkar",
	"follower_count": null,
	"occupation": "Data Cloud Architect, GSI Partners at Snowflake",
	"headline": "Data Cloud Architect,

## apache_pig_demo3.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                PrathameshNimkar
                / apache_pig_demo3.md
            
            
              Created
              June 17, 2020 15:05
            
              
                apache_pig_demo3
              
          
    str_dataset = LOAD '/user/root/projects/structuredFlightDataset/part-m-00000' USING PigStorage(',') AS (uid:int,FL_DATE:chararray,OP_UNIQUE_CARRIER:chararray,ORIGIN_AIRPORT_ID:int,ORIGIN_AIRPORT_SEQ_ID:int,ORIGIN_CITY_MARKET_ID:int,ORIGIN_CITY_NAME:chararray,DEST_AIRPORT_ID:int,DEST_AIRPORT_SEQ_ID:int,DEST_CITY_MARKET_ID:int,DEST_CITY_NAME:chararray,DEP_TIME:chararray,ARR_TIME:chararray);

str_dataset_filtered = FILTER str_dataset BY ORIGIN_CITY_NAME IN (''Atlanta',''Nashville',''Baltimore',''Dallas',''Houston');


Load & filter the data as before

strfil_dataset_fewcols = FOREACH str_dataset_filtered GENERATE uid, OP_UNIQUE_CARRIER, ORIGIN_CITY_NAME;


"FOREACH" & "GENERATE" used together to select only specific columns of interest

strfilfc_dataset_grouped = GROUP strfil_dataset_fewcols BY OP_UNIQUE_CARRIER;


"GROUP BY" a specific value

STORE strfilfc_dataset_grouped INTO '/user/root/projects/structuredFlightDataset/output';


"STORE" to push the data back into HDFS


## structured_data_set.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                PrathameshNimkar
                / structured_data_set.md
            
            
              Created
              June 17, 2020 14:05
            
              
                Structured Data Set
              
          
uid
FL_DATE
OP_UNIQUE_CARRIER
ORIGIN_AIRPORT_ID
ORIGIN_AIRPORT_SEQ_ID
ORIGIN_CITY_MARKET_ID
ORIGIN_CITY_NAME
DEST_AIRPORT_ID
DEST_AIRPORT_SEQ_ID
DEST_CITY_MARKET_ID
DEST_CITY_NAME
DEP_TIME
ARR_TIME


1
2020-01-01
EV
13930
1393007
30977
Chicago, IL
11977
1197705
31977
Green Bay, WI
1003
1117


2
2020-01-01
EV
15370
1537002
34653
Tulsa, OK
13930
1393007
30977
Chicago, IL
1027
1216


## apache_pig_demo2.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                PrathameshNimkar
                / apache_pig_demo2.md
            
            
              Created
              June 16, 2020 16:39
            
              
                apache_pig_demo2
              
          
    str_dataset = LOAD '/user/root/projects/structuredFlightDataset/part-m-00000' USING PigStorage(',') AS (uid:int,FL_DATE:chararray,OP_UNIQUE_CARRIER:chararray,ORIGIN_AIRPORT_ID:int,ORIGIN_AIRPORT_SEQ_ID:int,ORIGIN_CITY_MARKET_ID:int,ORIGIN_CITY_NAME:chararray,DEST_AIRPORT_ID:int,DEST_AIRPORT_SEQ_ID:int,DEST_CITY_MARKET_ID:int,DEST_CITY_NAME:chararray,DEP_TIME:chararray,ARR_TIME:chararray);


Load the data as before

str_dataset_filtered = FILTER str_dataset BY ORIGIN_CITY_NAME IN (''Atlanta',''Nashville',''Baltimore',''Dallas',''Houston');


Filter the data using the "FILTER" clause
"BY" column name
"IN" multiple values. You may use "==" if you have only 1 value to filter by
Single backslash can be used to escape the single quote error in data

DUMP str_dataset_filtered;


Display the data on the screen


## apache_pig_demo1.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                PrathameshNimkar
                / apache_pig_demo1.md
            
            
              Last active
              June 16, 2020 05:28
            
              
                apache_pig_demo1
              
          
    str_dataset = LOAD '/user/root/projects/structuredFlightDataset/part-m-00000' USING PigStorage(',') AS (uid:int,FL_DATE:chararray,OP_UNIQUE_CARRIER:chararray,ORIGIN_AIRPORT_ID:int,ORIGIN_AIRPORT_SEQ_ID:int,ORIGIN_CITY_MARKET_ID:int,ORIGIN_CITY_NAME:chararray,DEST_AIRPORT_ID:int,DEST_AIRPORT_SEQ_ID:int,DEST_CITY_MARKET_ID:int,DEST_CITY_NAME:chararray,DEP_TIME:chararray,ARR_TIME:chararray);


Loading the dataset using the built-in Pig function "LOAD"
"PigStorage" (case-sensitive) is the default load function. "USING" clause and "PigStorage" when used together can be omitted as both are default
AS helps to add the schema directly

DESCRIBE str_dataset; ILLUSTRATE str_dataset;


You can use DESC or DESCRIBE or ILLUSTRATE operators to view the schema

DUMP str_dataset;


DUMP keyword is used to display the output on the screen


## apache_pig_optimizers.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                PrathameshNimkar
                / apache_pig_optimizers.md
            
            
              Last active
              June 7, 2020 05:09
            
              
                Apache Pig Optimizers
              
          
Optimizers
Description


PartitionFilterOptimizer
Data is filtered while loading itself


PredicatePushdownOptimizer
Same as previous filter, but doesn't always work as expected


ConstantCalculator
Constants are evaluated first


PushUpFilter
Apply filter immediately after data is loaded. Change seen in DAG


MergeFil


## RecordWriter_OutputFormat.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                PrathameshNimkar
                / RecordWriter_OutputFormat.md
            
            
              Created
              May 29, 2020 16:45
            
              
                RecordWriter OutputFormat
              
          
Output Format
Description


Text
Writes (k,v) pair on individual lines of text file (most commonly used)


SequenceFile
Writes sequence files to output. Also used in intermediate Mapper output to HDFS


SequenceFileAsBinary
Similar to SequenceFile, just that it is in binary format


Multiple
Writes to files whose names are derived from output (k,v) pair


DB
Writes to SQL/NoSQL databases


## RecordReader_InputFormat.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                PrathameshNimkar
                / RecordReader_InputFormat.md
            
            
              Created
              May 29, 2020 16:37
            
              
                RecordReader InputFormat
              
          
Input Format
Description


KeyValueText
One (k,v) pair per line


Text
Key = line number and value = line (Most commonly used)


NLine
Any number of lines can make the input split


MultiFile
Multiple files in one split


SequenceFile
Input file is a Hadoop sequence file which has serialized (k,v) pair


## Data_in_blocks.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                PrathameshNimkar
                / Data_in_blocks.md
            
            
              Last active
              June 6, 2020 16:19
            
              
                Data in blocks
              
          
Block1
Sales (M)
Block3
Sales (M)


USA
1
UK
1


Russia
1
USA
1


UK
1
China
1


France
1
UK
1


China
1
USA
1


Russia
1
China
1


UK
1
UK
1


France
1
USA
1


## Tesla_Sales_Data.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                PrathameshNimkar
                / Tesla_Sales_Data.md
            
            
              Last active
              June 6, 2020 16:19
            
              
                Tesla Sales Data
              
          
Country
Sales(M)


USA
1


Russia
1


UK
1


France
1


China
1


Russia
1


UK
1


France
1
	{
	"public_identifier": "prathameshnimkar",
	"profile_pic_url": "https://s3.us-west-000.backblazeb2.com/proxycurl/person/prathameshnimkar/profile?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0004d7f56a0400b0000000001%2F20240321%2Fus-west-000%2Fs3%2Faws4_request&X-Amz-Date=20240321T085320Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=ac53dd3e66f9d9e56b00ea4621d51ca22602a3690ee432a8d7f5b7ade887a8dc",
	"background_cover_image_url": "https://s3.us-west-000.backblazeb2.com/proxycurl/person/prathameshnimkar/cover?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0004d7f56a0400b0000000001%2F20240321%2Fus-west-000%2Fs3%2Faws4_request&X-Amz-Date=20240321T085320Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=8383763af56e1599ab6673ab6d80bcafa2a17b012736ac52abc2e4172647e7cc",
	"first_name": "Prathamesh",
	"last_name": "Nimkar",
	"full_name": "Prathamesh Nimkar",
	"follower_count": null,
	"occupation": "Data Cloud Architect, GSI Partners at Snowflake",
	"headline": "Data Cloud Architect,
uid	FL_DATE	OP_UNIQUE_CARRIER	ORIGIN_AIRPORT_ID	ORIGIN_AIRPORT_SEQ_ID	ORIGIN_CITY_MARKET_ID	ORIGIN_CITY_NAME	DEST_AIRPORT_ID	DEST_AIRPORT_SEQ_ID	DEST_CITY_MARKET_ID	DEST_CITY_NAME	DEP_TIME	ARR_TIME
1	2020-01-01	EV	13930	1393007	30977	Chicago, IL	11977	1197705	31977	Green Bay, WI	1003	1117
2	2020-01-01	EV	15370	1537002	34653	Tulsa, OK	13930	1393007	30977	Chicago, IL	1027	1216
Optimizers	Description
PartitionFilterOptimizer	Data is filtered while loading itself
PredicatePushdownOptimizer	Same as previous filter, but doesn't always work as expected
ConstantCalculator	Constants are evaluated first
PushUpFilter	Apply filter immediately after data is loaded. Change seen in DAG
MergeFil
Output Format	Description
Text	Writes (k,v) pair on individual lines of text file (most commonly used)
SequenceFile	Writes sequence files to output. Also used in intermediate Mapper output to HDFS
SequenceFileAsBinary	Similar to SequenceFile, just that it is in binary format
Multiple	Writes to files whose names are derived from output (k,v) pair
DB	Writes to SQL/NoSQL databases
Input Format	Description
KeyValueText	One (k,v) pair per line
Text	Key = line number and value = line (Most commonly used)
NLine	Any number of lines can make the input split
MultiFile	Multiple files in one split
SequenceFile	Input file is a Hadoop sequence file which has serialized (k,v) pair
Block1	Sales (M)	Block3	Sales (M)
USA	1	UK	1
Russia	1	USA	1
UK	1	China	1
France	1	UK	1
China	1	USA	1
Russia	1	China	1
UK	1	UK	1
France	1	USA	1