Skip to content

Instantly share code, notes, and snippets.

@goungoun
Created April 10, 2018 04:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save goungoun/d8e451e6177041b47bb05df29334dc3c to your computer and use it in GitHub Desktop.
Save goungoun/d8e451e6177041b47bb05df29334dc3c to your computer and use it in GitHub Desktop.
GCP with Embulk

아래 자료는 황장군님의 강의자료를 GCP에서 테스트한 결과입니다.

ETL

embulk를 리눅스에 설치해보자. jar를 copy 하면 됨

curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar"
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
gounna@etl3:~$
gounna@etl3:~$
gounna@etl3:~$ curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 43.4M  100 43.4M    0     0  14.2M      0  0:00:03  0:00:03 --:--:-- 22.9M
gounna@etl3:~$ chmod +x ~/.embulk/bin/embulk
gounna@etl3:~$ echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
gounna@etl3:~$ source ~/.bashrc
gounna@etl3:~$ ll
total 32
drwxr-xr-x 5 gounna gounna 4096 Apr 10 02:34 ./
drwxr-xr-x 4 root   root   4096 Apr 10 02:30 ../
-rw-r--r-- 1 gounna gounna  220 Aug 31  2015 .bash_logout
-rw-r--r-- 1 gounna gounna 3809 Apr 10 02:34 .bashrc
drwx------ 2 gounna gounna 4096 Apr 10 02:30 .cache/
drwxr-x--- 3 gounna gounna 4096 Apr 10 02:34 .embulk/
-rw-r--r-- 1 gounna gounna  655 May 16  2017 .profile
drwx------ 2 gounna gounna 4096 Apr 10 02:34 .ssh/
gounna@etl3:~$
gounna@etl3:~$
gounna@etl3:~$ embulk
/home/gounna/.embulk/bin/embulk: line 171: exec: java: not found
gounna@etl3:~$ java
The program 'java' can be found in the following packages:
 * default-jre
 * gcj-5-jre-headless
 * openjdk-8-jre-headless
 * gcj-4.8-jre-headless
 * gcj-4.9-jre-headless
 * openjdk-9-jre-headless
Ask your administrator to install one of them
gounna@etl3:~$

설치

sudo apt-get upgrade
sudo apt-get install default-jre
sudo apt-get install default-jdk
embulk gem install embulk-output-postgresql

설치 후 확인

gounna@etl3:~$ embulk
Embulk v0.9.5
Usage: embulk [-vm-options] <command> [--options]
Commands:
   mkbundle   <directory>                             # create a new plugin bundle environment.
   bundle     [directory]                             # update a plugin bundle environment.
   run        <config.yml>                            # run a bulk load transaction.
   cleanup    <config.yml>                            # cleanup resume state.
   preview    <config.yml>                            # dry-run the bulk load without output and show preview.
   guess      <partial-config.yml> -o <output.yml>    # guess missing parameters to create a complete configuration file.
   gem        <install | list | help>                 # install a plugin or show installed plugins.
   new        <category> <name>                       # generates new plugin template
   migrate    <path>                                  # modify plugin code to use the latest Embulk plugin API
   example    [path]                                  # creates an example config file and csv file to try embulk.
   selfupdate [version]                               # upgrades embulk to the latest released version or to the specified version.

VM options:
   -E...                            Run an external script to configure environment variables in JVM
                                    (Operations not just setting envs are not recommended nor guaranteed.
                                     Expect side effects by running your external script at your own risk.)
   -J-O                             Disable JVM optimizations to speed up startup time (enabled by default if command is 'run')
   -J+O                             Enable JVM optimizations to speed up throughput
   -J...                            Set JVM options (use -J-help to see available options)
   -R...                            Set JRuby options (use -R--help to see available options)

Use `<command> --help` to see description of the commands.
gounna@etl3:~$ embulk example etl3
2018-04-10 02:44:22.887 +0000: Embulk v0.9.5
Creating etl3 directory...
  Creating etl3/
  Creating etl3/csv/
  Creating etl3/csv/sample_01.csv.gz
  Creating etl3/seed.yml

Run following subcommands to try embulk:

   1. embulk guess etl3/seed.yml -o config.yml
   2. embulk preview config.yml
   3. embulk run config.yml

gounna@etl3:~$ ls
etl3
gounna@etl3:~$ cd etl3
gounna@etl3:~/etl3$ ls -al
total 16
drwxrwxr-x 3 gounna gounna 4096 Apr 10 02:44 .
drwxr-xr-x 6 gounna gounna 4096 Apr 10 02:44 ..
drwxrwxr-x 2 gounna gounna 4096 Apr 10 02:44 csv
-rw-rw-r-- 1 gounna gounna   84 Apr 10 02:44 seed.yml

작업지시를 yml에 작성하게 된다. 데이터 소스의 접속정보와 무슨 컬럼을 가져와서 타겟쪽에 어떻게 적재할 것인지

embulk가 해주는 일은 csv를 위부터 몇 줄을 읽어서 스스로 이런 형태라고 작업 지시를 반 정도 채워준다.

embulk 주요 명령어

embulk guess seed.yml // CSV 데이터 형태 추정
embulk guess seed.yml -o etl3.yml // 추정한 것을 작업지시서로 생성
embulk preview etl3.yml // 실행 전에 미리 몇 줄만 보기
embulk run etl3.yml // 진짜 실행
gounna@etl3:~/etl3$ cat seed.yml
in:
  type: file
  path_prefix: '/home/gounna/etl3/csv/sample_'
out:
  type: stdout
gounna@etl3:~/etl3$ embulk guess seed.yml
2018-04-10 02:47:43.740 +0000: Embulk v0.9.5
2018-04-10 02:47:44.420 +0000 [INFO] (main): Started Embulk v0.9.5
2018-04-10 02:47:44.474 +0000 [INFO] (0001:guess): Listing local files at directory '/home/gounna/etl3/csv' filtering filename by prefix 'sample_'
2018-04-10 02:47:44.475 +0000 [INFO] (0001:guess): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2018-04-10 02:47:44.479 +0000 [INFO] (0001:guess): Loading files [/home/gounna/etl3/csv/sample_01.csv.gz]
2018-04-10 02:47:44.491 +0000 [INFO] (0001:guess): Try to read 32,768 bytes from input source
2018-04-10 02:47:46.915 +0000 [INFO] (0001:guess): Gem's home and path are set by default: "/home/gounna/.embulk/lib/gems"
2018-04-10 02:47:47.833 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.5)
2018-04-10 02:47:47.853 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.5)
2018-04-10 02:47:47.872 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.5)
2018-04-10 02:47:47.888 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.5)
in:
  type: file
  path_prefix: /home/gounna/etl3/csv/sample_
  decoders:
  - {type: gzip}
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out: {type: stdout}

Use -o PATH option to write the guessed config file to a file.
gounna@etl3:~/etl3$ embulk preview etl3.yml
2018-04-10 02:56:46.703 +0000: Embulk v0.9.5
2018-04-10 02:56:47.378 +0000 [INFO] (main): Started Embulk v0.9.5
2018-04-10 02:56:47.436 +0000 [INFO] (0001:preview): Listing local files at directory '/home/gounna/etl3/csv' filtering filename by prefix 'sample_'
2018-04-10 02:56:47.437 +0000 [INFO] (0001:preview): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2018-04-10 02:56:47.440 +0000 [INFO] (0001:preview): Loading files [/home/gounna/etl3/csv/sample_01.csv.gz]
2018-04-10 02:56:47.451 +0000 [INFO] (0001:preview): Try to read 32,768 bytes from input source
+---------+--------------+-------------------------+-------------------------+----------------------------+
| id:long | account:long |          time:timestamp |      purchase:timestamp |             comment:string |
+---------+--------------+-------------------------+-------------------------+----------------------------+
|       1 |       32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC |                     embulk |
|       2 |       14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC |               embulk jruby |
|       3 |       27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
|       4 |       11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC |                            |
+---------+--------------+-------------------------+-------------------------+----------------------------+
gounna@etl3:~/etl3$
@goungoun
Copy link
Author

embulk gem install embulk-input-postgresql
embulk preview customers.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment