Skip to content

Instantly share code, notes, and snippets.

View danoyoung's full-sized avatar

Dan Young danoyoung

  • Independent
  • Mars
View GitHub Profile
<snip>
....
....
{
:name=>'step3',
:script_bootstrap_action => {:path=>'s3n://elasticmapreduce/bootstrap-actions/run-if',
:args=>['instance.isMaster=false','s3n://my_coolio_bucket/bootstrap-actions/copy_to_slave_nodes.sh']}
},
....
....
grunt> set io.sort.mb 150;
grunt> /*
grunt> set mapred.reduce.task 1;
grunt> gets all the people for a franchise.
grunt> rm avro/franchise_people;
grunt> */
grunt> franchise_people = LOAD 'hdfs://127.0.0.1:9000/user/hadoop/indexer/avro/franchise_people' using org.apache.pig.piggybank.storage.avro.AvroStorage();
grunt>
grunt> a = FILTER franchise_people BY (role_type == 'cast') OR (role_type == 'crew');
grunt> b = GROUP a BY (franchise_id);
1. The raw input looks like this:
4302653 df0cfc4f187e6f6258fbe732ed2cbcf5 42199 152 44390 cast Actor 3 Cliff Nazarro 2010-04-28 03:51:25 2010-04-28 03:51:25
4302654 df0cfc4f187e6f6258fbe732ed2cbcf5 42199 153 541 cast Actor 1 Russell Hayden 2010-04-28 03:51:25 2010-04-28 03:51:25
4302655 df0cfc4f187e6f6258fbe732ed2cbcf5 42199 154 46074 cast Actor 2 Inez Cooper 2010-04-28 03:51:25 2010-04-28 03:51:25
2. Then the raw data is converted and stored into an Avro file with the following pig script:
set io.sort.mb 150;
set mapred.reduce.task 0;
@danoyoung
danoyoung / gist:2191363
Created March 25, 2012 04:29
Pig-AvroStorage
Apache Pig version 0.11.0-SNAPSHOT (r1304979) compiled Mar 24 2012, 21:48:44
Run my pig script to get my bag of tuples.....
....
....
....
grunt> describe c;
c: {franchise_id: int,cast_and_crew: {(full_name: chararray)}}
grunt>illustrate c;
SELECT
COUNT(*) AS click_count,
SUM(c.total_cost_to_advertiser) AS total_cost_to_advertiser,
SUM(c.optimizer_bid_price) AS optimizer_bid_price,
SUM(c.optimizer_pending_earnings) AS optimizer_pending_earnings,
SUM(c.optimizer_paid_amount) AS optimizer_paid_amount,
SUM(c.market_rake_amount) AS market_rake_amount,
SUM(c.advertiser_refund) AS advertiser_refund,
SUM(c.ad_network_cost) AS ad_network_cost,
SUM(c.ad_network_refund) AS ad_network_refund,
SELECT
COUNT(*) AS click_count,
SUM(c.total_cost_to_advertiser) AS total_cost_to_advertiser,
SUM(c.optimizer_bid_price) AS optimizer_bid_price,
SUM(c.optimizer_pending_earnings) AS optimizer_pending_earnings,
SUM(c.optimizer_paid_amount) AS optimizer_paid_amount,
SUM(c.market_rake_amount) AS market_rake_amount,
SUM(c.advertiser_refund) AS advertiser_refund,
SUM(c.ad_network_cost) AS ad_network_cost,
SUM(c.ad_network_refund) AS ad_network_refund,
mysql> explain SELECT COUNT(*) AS click_count, SUM(c.total_cost_to_advertiser) AS total_cost_to_advertiser, SUM(c.optimizer_bid_price) AS optimizer_bid_price, SUM(c.optimizer_pending_earnings) AS optimizer_pending_earnings, SUM(c.optimizer_paid_amount) AS optimizer_paid_amount, SUM(c.market_rake_amount) AS market_rake_amount, SUM(c.advertiser_refund) AS advertiser_refund, SUM(c.ad_network_cost) AS ad_network_cost, SUM(c.ad_network_refund) AS ad_network_refund, c.campaign_group_id,c.optimizer_id,c.ad_network_id FROM click_registers c INNER JOIN mirror_daily_ad_network_optimizer_campaign_groups ON ( c.campaign_group_id
curl -XPOST 'http://localhost:9200/sizonet/_search?pretty=true' -d '
{
"query" : {
"has_child" : {
"type" : "ice",
"query" : {
"term" : {
"ice.shorefast.observation" : "thickening"
}
}