Skip to content

Instantly share code, notes, and snippets.

@goungoun
Created April 10, 2018 04:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save goungoun/946caba852187a4f4b9ac05f3f877817 to your computer and use it in GitHub Desktop.
Save goungoun/946caba852187a4f4b9ac05f3f877817 to your computer and use it in GitHub Desktop.
Embulk-GCP Test
아래 자료는 황장군님의 강의자료를 GCP에서 테스트한 결과입니다.
# ETL
- fluntd (streaming)
- embulk (batch) http://www.embulk.org/docs/
embulk를 리눅스에 설치해보자. jar를 copy 하면 됨
~~~bash
curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar"
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
~~~
~~~
gounna@etl3:~$
gounna@etl3:~$
gounna@etl3:~$ curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 43.4M 100 43.4M 0 0 14.2M 0 0:00:03 0:00:03 --:--:-- 22.9M
gounna@etl3:~$ chmod +x ~/.embulk/bin/embulk
gounna@etl3:~$ echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
gounna@etl3:~$ source ~/.bashrc
gounna@etl3:~$ ll
total 32
drwxr-xr-x 5 gounna gounna 4096 Apr 10 02:34 ./
drwxr-xr-x 4 root root 4096 Apr 10 02:30 ../
-rw-r--r-- 1 gounna gounna 220 Aug 31 2015 .bash_logout
-rw-r--r-- 1 gounna gounna 3809 Apr 10 02:34 .bashrc
drwx------ 2 gounna gounna 4096 Apr 10 02:30 .cache/
drwxr-x--- 3 gounna gounna 4096 Apr 10 02:34 .embulk/
-rw-r--r-- 1 gounna gounna 655 May 16 2017 .profile
drwx------ 2 gounna gounna 4096 Apr 10 02:34 .ssh/
gounna@etl3:~$
gounna@etl3:~$
gounna@etl3:~$ embulk
/home/gounna/.embulk/bin/embulk: line 171: exec: java: not found
gounna@etl3:~$ java
The program 'java' can be found in the following packages:
* default-jre
* gcj-5-jre-headless
* openjdk-8-jre-headless
* gcj-4.8-jre-headless
* gcj-4.9-jre-headless
* openjdk-9-jre-headless
Ask your administrator to install one of them
gounna@etl3:~$
~~~
설치
~~~
sudo apt-get upgrade
sudo apt-get install default-jre
sudo apt-get install default-jdk
embulk gem install embulk-output-postgresql
~~~
설치 후 확인
~~~
gounna@etl3:~$ embulk
Embulk v0.9.5
Usage: embulk [-vm-options] <command> [--options]
Commands:
mkbundle <directory> # create a new plugin bundle environment.
bundle [directory] # update a plugin bundle environment.
run <config.yml> # run a bulk load transaction.
cleanup <config.yml> # cleanup resume state.
preview <config.yml> # dry-run the bulk load without output and show preview.
guess <partial-config.yml> -o <output.yml> # guess missing parameters to create a complete configuration file.
gem <install | list | help> # install a plugin or show installed plugins.
new <category> <name> # generates new plugin template
migrate <path> # modify plugin code to use the latest Embulk plugin API
example [path] # creates an example config file and csv file to try embulk.
selfupdate [version] # upgrades embulk to the latest released version or to the specified version.
VM options:
-E... Run an external script to configure environment variables in JVM
(Operations not just setting envs are not recommended nor guaranteed.
Expect side effects by running your external script at your own risk.)
-J-O Disable JVM optimizations to speed up startup time (enabled by default if command is 'run')
-J+O Enable JVM optimizations to speed up throughput
-J... Set JVM options (use -J-help to see available options)
-R... Set JRuby options (use -R--help to see available options)
Use `<command> --help` to see description of the commands.
~~~
~~~bash
gounna@etl3:~$ embulk example etl3
2018-04-10 02:44:22.887 +0000: Embulk v0.9.5
Creating etl3 directory...
Creating etl3/
Creating etl3/csv/
Creating etl3/csv/sample_01.csv.gz
Creating etl3/seed.yml
Run following subcommands to try embulk:
1. embulk guess etl3/seed.yml -o config.yml
2. embulk preview config.yml
3. embulk run config.yml
gounna@etl3:~$ ls
etl3
gounna@etl3:~$ cd etl3
gounna@etl3:~/etl3$ ls -al
total 16
drwxrwxr-x 3 gounna gounna 4096 Apr 10 02:44 .
drwxr-xr-x 6 gounna gounna 4096 Apr 10 02:44 ..
drwxrwxr-x 2 gounna gounna 4096 Apr 10 02:44 csv
-rw-rw-r-- 1 gounna gounna 84 Apr 10 02:44 seed.yml
~~~
작업지시를 yml에 작성하게 된다.
데이터 소스의 접속정보와 무슨 컬럼을 가져와서 타겟쪽에 어떻게 적재할 것인지
embulk가 해주는 일은 csv를 위부터 몇 줄을 읽어서 스스로 이런 형태라고 작업 지시를 반 정도 채워준다.
embulk 주요 명령어
~~~ bash
embulk guess seed.yml // CSV 데이터 형태 추정
embulk guess seed.yml -o etl3.yml // 추정한 것을 작업지시서로 생성
embulk preview etl3.yml // 실행 전에 미리 몇 줄만 보기
embulk run etl3.yml // 진짜 실행
~~~
~~~bash
gounna@etl3:~/etl3$ cat seed.yml
in:
type: file
path_prefix: '/home/gounna/etl3/csv/sample_'
out:
type: stdout
~~~
~~~bash
gounna@etl3:~/etl3$ embulk guess seed.yml
2018-04-10 02:47:43.740 +0000: Embulk v0.9.5
2018-04-10 02:47:44.420 +0000 [INFO] (main): Started Embulk v0.9.5
2018-04-10 02:47:44.474 +0000 [INFO] (0001:guess): Listing local files at directory '/home/gounna/etl3/csv' filtering filename by prefix 'sample_'
2018-04-10 02:47:44.475 +0000 [INFO] (0001:guess): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2018-04-10 02:47:44.479 +0000 [INFO] (0001:guess): Loading files [/home/gounna/etl3/csv/sample_01.csv.gz]
2018-04-10 02:47:44.491 +0000 [INFO] (0001:guess): Try to read 32,768 bytes from input source
2018-04-10 02:47:46.915 +0000 [INFO] (0001:guess): Gem's home and path are set by default: "/home/gounna/.embulk/lib/gems"
2018-04-10 02:47:47.833 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.5)
2018-04-10 02:47:47.853 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.5)
2018-04-10 02:47:47.872 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.5)
2018-04-10 02:47:47.888 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.5)
in:
type: file
path_prefix: /home/gounna/etl3/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: LF
type: csv
delimiter: ','
quote: '"'
escape: '"'
null_string: 'NULL'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
out: {type: stdout}
Use -o PATH option to write the guessed config file to a file.
~~~
~~~bash
gounna@etl3:~/etl3$ embulk preview etl3.yml
2018-04-10 02:56:46.703 +0000: Embulk v0.9.5
2018-04-10 02:56:47.378 +0000 [INFO] (main): Started Embulk v0.9.5
2018-04-10 02:56:47.436 +0000 [INFO] (0001:preview): Listing local files at directory '/home/gounna/etl3/csv' filtering filename by prefix 'sample_'
2018-04-10 02:56:47.437 +0000 [INFO] (0001:preview): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2018-04-10 02:56:47.440 +0000 [INFO] (0001:preview): Loading files [/home/gounna/etl3/csv/sample_01.csv.gz]
2018-04-10 02:56:47.451 +0000 [INFO] (0001:preview): Try to read 32,768 bytes from input source
+---------+--------------+-------------------------+-------------------------+----------------------------+
| id:long | account:long | time:timestamp | purchase:timestamp | comment:string |
+---------+--------------+-------------------------+-------------------------+----------------------------+
| 1 | 32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| 2 | 14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| 3 | 27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| 4 | 11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | |
+---------+--------------+-------------------------+-------------------------+----------------------------+
gounna@etl3:~/etl3$
~~~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment