Skip to content

Instantly share code, notes, and snippets.

@btbytes
Created May 2, 2016 02:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save btbytes/a498b4323a77452669f48e548cb7f422 to your computer and use it in GitHub Desktop.
Save btbytes/a498b4323a77452669f48e548cb7f422 to your computer and use it in GitHub Desktop.
Apache Pig Perf comparison for FILTER
REGISTER /home/pradeep/jars/piggybank-0.12.0.jar;
%DECLARE infile '/home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/NYSE_daily_prices_A.csv';
%DECLARE outfile '/home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/tmp.csv';
RMF $outfile;
data = LOAD '$infile'
USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
AS(exchange:chararray,stock_symbol:chararray, date:chararray, stock_price_open:float,stock_price_high:float,stock_price_low:float, stock_price_close:float, stock_volume:float,stock_price_adj_close:float);
-- data = FILTER data by stock_symbol IS NOT NULL ;
-- data = FILTER data by stock_symbol == 'AEA';
-- data = FILTER data by stock_price_open > 10.00;
data = FILTER data by stock_symbol IS NOT NULL AND stock_symbol == 'AEA' AND stock_price_open > 10.00;
STORE data INTO '$outfile' USING PigStorage(',');

Apache pig FILTER performance difference

With only loading

data = LOAD '/home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/NYSE_daily_prices_A.csv' USING PigStorage('\n') AS (line:chararray);

Time:

2016-05-01 21:57:23,872 [main] INFO  org.apache.pig.Main - Pig script completed in 2 seconds and 311 milliseconds (2311 ms)

With loading and saving

REGISTER /home/pradeep/jars/piggybank-0.12.0.jar;

%DECLARE infile '/home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/NYSE_daily_prices_A.csv';
%DECLARE outfile '/home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/tmp.csv';

RMF $outfile;

data = LOAD '$infile'
	USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
	AS(exchange:chararray,stock_symbol:chararray, date:chararray, stock_price_open:float,stock_price_high:float,stock_price_low:float, stock_price_close:float, stock_volume:float,stock_price_adj_close:float);
data =  FILTER data by stock_symbol IS NOT NULL;
STORE data INTO  '$outfile' USING PigStorage(',');

Results

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
2.6.0   0.15.0  pradeep 2016-05-01 22:20:47     2016-05-01 22:20:56     FILTER

Success!

Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTime      AvgMapTime      MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   MedianReducetime        Alias   Feature Outputs
job_local1391761639_0001        2       0       n/a     n/a     n/a     n/a     0       0       0       0       data    MAP_ONLY        /home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/tmp.csv,

Input(s):
Successfully read 735026 records from: "/home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/NYSE_daily_prices_A.csv"

Output(s):
Successfully stored 735026 records in: "/home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/tmp.csv"

Counters:
Total records written : 735026
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_local1391761639_0001

2016-05-01 22:20:56,471 [main] INFO  org.apache.pig.Main - Pig script completed in 11 seconds and 578 milliseconds (11578 ms)

With filter one after the other

data =  FILTER data by stock_symbol IS NOT NULL;
data = FILTER data by stock_symbol == 'AEA';
data =  FILTER data by stock_price_open > 10.00;
HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
2.6.0   0.15.0  pradeep 2016-05-01 22:28:49     2016-05-01 22:28:56     FILTER

Success!

Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTime      AvgMapTime      MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   MedianReducetime        Alias   Feature Outputs
job_local932867549_0001 2       0       n/a     n/a     n/a     n/a     0       0       0       0       data    MAP_ONLY        /home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/tmp.csv,

Input(s):
Successfully read 735026 records from: "/home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/NYSE_daily_prices_A.csv"

Output(s):
Successfully stored 718 records in: "/home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/tmp.csv"

Counters:
Total records written : 718
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_local932867549_0001


2016-05-01 22:28:56,363 [main] INFO  org.apache.pig.Main - Pig script completed in 9 seconds and 636 milliseconds (9636 ms)
2016-05-01 22:29:46,324 [main] INFO  org.apache.pig.Main - Pig script completed in 9 seconds and 484 milliseconds (9484 ms)
2016-05-01 22:30:18,272 [main] INFO  org.apache.pig.Main - Pig script completed in 9 seconds and 303 milliseconds (9303 ms)

With AND filter

Code

data = FILTER data by stock_symbol IS NOT NULL AND stock_symbol == 'AEA' AND stock_price_open > 10.00;

Output

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
2.6.0   0.15.0  pradeep 2016-05-01 22:31:29     2016-05-01 22:31:35     FILTER

Success!

Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTime      AvgMapTime      MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   MedianReducetime        Alias   Feature Outputs
job_local1828249770_0001        2       0       n/a     n/a     n/a     n/a     0       0       0       0       data    MAP_ONLY        /home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/tmp.csv,

Input(s):
Successfully read 735026 records from: "/home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/NYSE_daily_prices_A.csv"

Output(s):
Successfully stored 718 records in: "/home/pradeep/data/infochimps_dataset_4778_download_16677/NYSE/tmp.csv"

Counters:
Total records written : 718
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_local1828249770_0001

2016-05-01 22:31:35,558 [main] INFO  org.apache.pig.Main - Pig script completed in 9 seconds and 381 milliseconds (9381 ms)
2016-05-01 22:33:03,054 [main] INFO  org.apache.pig.Main - Pig script completed in 9 seconds and 491 milliseconds (9491 ms)
2016-05-01 22:33:25,357 [main] INFO  org.apache.pig.Main - Pig script completed in 9 seconds and 437 milliseconds (9437 ms)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment