libitte/apach_pig_basic_memo.md

## apach_pig_basic_memo.md

      
    Raw
  

              apach_pig_basic_memo.md
            
          
    #apache pig basic memo
##LOAD
hdfsなどのファイルシステムからデータを読み込む。
このようなデータに対してLOADすると、
$ cat myfile.txt
1 2 3
4 2 1
8 3 4

以下の様なかんじ。
A = LOAD 'myfile.txt';

A = LOAD 'myfile.txt' USING PigStorage('\t');

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)

FOREACH ... GENERATE

データの転換を行います。
A =
<1, 2, 3>
<4, 2, 1>
<8, 3, 4>
<4, 3, 3>
<7, 2, 5>
<8, 4, 3>

のようなとき、
X = FOREACH A GENERATE f1, f2;

すると、
X=
<1, 2>
<4, 2>
<8, 3>
<4, 3>
<7, 2>
<8, 4>

のようになる。
B =
<2, 4>
<8, 9>
<1, 3>
<2, 7>
<2, 9>
<4, 6>
<4, 9>

C =
<1, {<1, 2, 3>}, {<1, 3>}>
<4, {<4, 2, 1>, <4, 3, 3>}, {<4, 6>, <4, 9>}>
<8, {<8, 3, 4>, <8, 4, 3>}, {<8, 9>}>

GROUP

X = GROUP A BY f1;
X = GROUP A BY (f1, f2 ..);
FLATTEN

An operator that changes the structure of tuples and bags in a way that a UDF cannot.
consider a relation that has a tuple of the form (a, (b, c)).
The expression
GENERATE $0, flatten($1)

, will cause that tuple to become
(a, b, c)

.
SUM

JOIN(inner)

A = LOAD 'data1' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

B = LOAD 'data2' AS (b1:int,b2:int);

DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)

第一要素をもとにA, B両方に存在するものを作成。
X = JOIN A BY a1, B BY b1;

DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)

JOIN(outer)

The Pig Latin syntax closely adheres to the SQL standard.

Outer joins will only work provided the relations which need to produce nulls (in the case of non-matching keys) have schemas.
Outer joins will only work for two-way joins; to perform a multi-way outer join, you will need to perform multiple two-way outer join statements.
例えば left join はこんなかんじ。

A = LOAD 'a.txt' AS (n:chararray, a:int); 
B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A by $0 LEFT OUTER, B BY $0;

##FILTER
条件でフィルタリングする。
A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

３番目の要素が３のものだけ抽出。
X = FILTER A BY f3 == 3;

DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)

STORE

ファイルシステムに保存する。
A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

STORE A INTO 'myoutput' USING PigStorage ('*');

CAT myoutput;
1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3

##UNION
２つ以上のコンテンツをマージします。
###スキーマの振る舞い
サイズが異なる場合、null schemaとなる。
A: (a1:long, a2:long) 
B: (b1:long, b2:long, b3:long) 
A union B: null 

カラム属性が異なる場合。例えば下記ではbytearrayになる。
A: (a1:long, a2:long) 
B: (b1:(b11:long, b12:long), b2:long) 
A union B: (a1:bytearray, a2:long) 

Union columns of compatible type will produce an "escalate" type. The priority is:

double > float > long > int > bytearray
tuple|bag|map|chararray > bytearray

A: (a1:int, a2:bytearray, a3:int) 
B: (b1:float, b2:chararray, b3:bytearray) 
A union B: (a1:float, a2:chararray, a3:int) 

###Example1
In this example the union of relation A and B is computed.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)

B = LOAD 'data' AS (b1:int,b2:int);

DUMP A;
(2,4)
(8,9)
(1,3)

X = UNION A, B;

DUMP X;
(1,2,3)
(4,2,1)
(2,4)
(8,9)
(1,3)

###Example2
This example shows the use of ONSCHEMA.
L1 = LOAD 'f1' USING (a : int, b : float);
DUMP L1;
(11,12.0)
(21,22.0)

L2 = LOAD  'f1' USING (a : long, c : chararray);
DUMP L2;
(11,a)
(12,b)
(13,c)

U = UNION ONSCHEMA L1, L2;
DESCRIBE U ;
U : {a : long, b : float, c : chararray}

DUMP U;
(11,12.0,)
(21,22.0,)
(11,,a)
(12,,b)
(13,,c)

refs.
http://pig.apache.org/docs/r0.10.0/basic.html