Skip to content

Instantly share code, notes, and snippets.

@libitte
Last active December 16, 2015 12:29
Show Gist options
  • Save libitte/5434851 to your computer and use it in GitHub Desktop.
Save libitte/5434851 to your computer and use it in GitHub Desktop.

#apache pig basic memo

##LOAD hdfsなどのファイルシステムからデータを読み込む。 このようなデータに対してLOADすると、

$ cat myfile.txt
1 2 3
4 2 1
8 3 4

以下の様なかんじ。

A = LOAD 'myfile.txt';

A = LOAD 'myfile.txt' USING PigStorage('\t');

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)

FOREACH ... GENERATE

データの転換を行います。 A =

<1, 2, 3>
<4, 2, 1>
<8, 3, 4>
<4, 3, 3>
<7, 2, 5>
<8, 4, 3>

のようなとき、

X = FOREACH A GENERATE f1, f2;

すると、 X=

<1, 2>
<4, 2>
<8, 3>
<4, 3>
<7, 2>
<8, 4>

のようになる。

B =

<2, 4>
<8, 9>
<1, 3>
<2, 7>
<2, 9>
<4, 6>
<4, 9>

C =

<1, {<1, 2, 3>}, {<1, 3>}>
<4, {<4, 2, 1>, <4, 3, 3>}, {<4, 6>, <4, 9>}>
<8, {<8, 3, 4>, <8, 4, 3>}, {<8, 9>}>

GROUP

X = GROUP A BY f1; X = GROUP A BY (f1, f2 ..);

FLATTEN

An operator that changes the structure of tuples and bags in a way that a UDF cannot.

consider a relation that has a tuple of the form (a, (b, c)). The expression

GENERATE $0, flatten($1)

, will cause that tuple to become

(a, b, c)

.

SUM

JOIN(inner)

A = LOAD 'data1' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

B = LOAD 'data2' AS (b1:int,b2:int);

DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)

第一要素をもとにA, B両方に存在するものを作成。

X = JOIN A BY a1, B BY b1;

DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)

JOIN(outer)

The Pig Latin syntax closely adheres to the SQL standard.

  • Outer joins will only work provided the relations which need to produce nulls (in the case of non-matching keys) have schemas.
  • Outer joins will only work for two-way joins; to perform a multi-way outer join, you will need to perform multiple two-way outer join statements. 例えば left join はこんなかんじ。
A = LOAD 'a.txt' AS (n:chararray, a:int); 
B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A by $0 LEFT OUTER, B BY $0;

##FILTER 条件でフィルタリングする。

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

3番目の要素が3のものだけ抽出。

X = FILTER A BY f3 == 3;

DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)

STORE

ファイルシステムに保存する。

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

STORE A INTO 'myoutput' USING PigStorage ('*');

CAT myoutput;
1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3

##UNION 2つ以上のコンテンツをマージします。

###スキーマの振る舞い サイズが異なる場合、null schemaとなる。

A: (a1:long, a2:long) 
B: (b1:long, b2:long, b3:long) 
A union B: null 

カラム属性が異なる場合。例えば下記ではbytearrayになる。

A: (a1:long, a2:long) 
B: (b1:(b11:long, b12:long), b2:long) 
A union B: (a1:bytearray, a2:long) 

Union columns of compatible type will produce an "escalate" type. The priority is:

  • double > float > long > int > bytearray
  • tuple|bag|map|chararray > bytearray
A: (a1:int, a2:bytearray, a3:int) 
B: (b1:float, b2:chararray, b3:bytearray) 
A union B: (a1:float, a2:chararray, a3:int) 

###Example1 In this example the union of relation A and B is computed.

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)

B = LOAD 'data' AS (b1:int,b2:int);

DUMP A;
(2,4)
(8,9)
(1,3)

X = UNION A, B;

DUMP X;
(1,2,3)
(4,2,1)
(2,4)
(8,9)
(1,3)

###Example2 This example shows the use of ONSCHEMA.

L1 = LOAD 'f1' USING (a : int, b : float);
DUMP L1;
(11,12.0)
(21,22.0)

L2 = LOAD  'f1' USING (a : long, c : chararray);
DUMP L2;
(11,a)
(12,b)
(13,c)

U = UNION ONSCHEMA L1, L2;
DESCRIBE U ;
U : {a : long, b : float, c : chararray}

DUMP U;
(11,12.0,)
(21,22.0,)
(11,,a)
(12,,b)
(13,,c)

refs. http://pig.apache.org/docs/r0.10.0/basic.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment