Skip to content

Instantly share code, notes, and snippets.

@hiroyuki-sato
Last active May 22, 2017 07:39
Show Gist options
  • Save hiroyuki-sato/10c4f31028f309864088d87fb92872e2 to your computer and use it in GitHub Desktop.
Save hiroyuki-sato/10c4f31028f309864088d87fb92872e2 to your computer and use it in GitHub Desktop.
embulk-filter-timestamp_format のjavaとJRubyのパーサーの動作について

embulk-filter-timestamp_formatのjava版パーサー(joda-time)とJRuby版パーサー(Date#_strptime)の比較

  • 日付データの中に、「2015-01-27 19:23:49 aaa」と末尾に「aaa」のようなゴミデータがある場合の挙動について調査
  • Javaのパーサー(joda-timeの場合): エラーになる。
  • JRubyのパーサー(Date._strptime)の場合: aaaの前までをパースし日付データを作る

サンプルデータ

id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49 aaa,20150127,embulk
2,14824,2015-01-27 19:01:23 bbb,20150127,embulk jruby
3,27559,2015-01-28 02:20:02 ccc,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36 ddd,20150129,NULL

joda-timeの場合

filters:
  - type: timestamp_format
    default_from_timestamp_format: ["yyyy-MM-dd HH:mm:ss"]
    columns:
      - {name: time, type: timestamp }
2017-05-22 16:04:20.309 +0900: Embulk v0.8.22
2017-05-22 16:04:21.311 +0900 [INFO] (0001:preview): Loaded plugin embulk-filter-timestamp_format (0.2.4)
2017-05-22 16:04:21.331 +0900 [INFO] (0001:preview): Listing local files at directory '/private/tmp/sample/csv' filtering filename by prefix 'sample_'
2017-05-22 16:04:21.332 +0900 [INFO] (0001:preview): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2017-05-22 16:04:21.334 +0900 [INFO] (0001:preview): Loading files [/private/tmp/sample/csv/sample_01.csv]
2017-05-22 16:04:21.343 +0900 [INFO] (0001:preview): Try to read 32,768 bytes from input source
2017-05-22 16:04:21.438 +0900 [WARN] (0001:preview): failed to parse string: "2015-01-27 19:23:49 aaa"
2017-05-22 16:04:21.439 +0900 [WARN] (0001:preview): failed to parse string: "2015-01-27 19:01:23 bbb"
2017-05-22 16:04:21.439 +0900 [WARN] (0001:preview): failed to parse string: "2015-01-28 02:20:02 ccc"
2017-05-22 16:04:21.439 +0900 [WARN] (0001:preview): failed to parse string: "2015-01-29 11:54:36 ddd"
+---------+--------------+----------------+-------------------------+----------------------------+
| id:long | account:long | time:timestamp |      purchase:timestamp |             comment:string |
+---------+--------------+----------------+-------------------------+----------------------------+
|       1 |       32,864 |                | 2015-01-27 00:00:00 UTC |                     embulk |
|       2 |       14,824 |                | 2015-01-27 00:00:00 UTC |               embulk jruby |
|       3 |       27,559 |                | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
|       4 |       11,270 |                | 2015-01-29 00:00:00 UTC |                            |
+---------+--------------+----------------+-------------------------+----------------------------+

JRubyの場合

filters:
  - type: timestamp_format
    default_from_timestamp_format: ["%Y-%m-%d %H:%M:%S"]
    columns:
      - {name: time, type: timestamp }
2017-05-22 16:28:12.262 +0900: Embulk v0.8.22
2017-05-22 16:28:13.293 +0900 [INFO] (0001:preview): Loaded plugin embulk-filter-timestamp_format (0.2.4)
2017-05-22 16:28:13.312 +0900 [INFO] (0001:preview): Listing local files at directory '/private/tmp/sample/csv' filtering filename by prefix 'sample_'
2017-05-22 16:28:13.312 +0900 [INFO] (0001:preview): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2017-05-22 16:28:13.315 +0900 [INFO] (0001:preview): Loading files [/private/tmp/sample/csv/sample_01.csv]
2017-05-22 16:28:13.324 +0900 [INFO] (0001:preview): Try to read 32,768 bytes from input source
+---------+--------------+-------------------------+-------------------------+----------------------------+
| id:long | account:long |          time:timestamp |      purchase:timestamp |             comment:string |
+---------+--------------+-------------------------+-------------------------+----------------------------+
|       1 |       32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC |                     embulk |
|       2 |       14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC |               embulk jruby |
|       3 |       27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
|       4 |       11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC |                            |
+---------+--------------+-------------------------+-------------------------+----------------------------+

JRubyの動作

パースできなかった文字は、:leftoverというところにデータが入る。

Date._strptime("2017/05/22 23:12:34 hogehogehoge","%Y/%m/%d %H:%M:%S")
=> {:year=>2017, :mon=>5, :mday=>22, :hour=>23, :min=>12, :sec=>34, :leftover=>" hogehogehoge"}

https://github.com/embulk/embulk/blob/master/lib/embulk/java/time_helper.rb#L27

なお行頭にゴミデータがある場合はエラーとなる。

Time.strptime("aaa 2017/05/22 23:12:34 hogehogehoge","%Y/%m/%d %H:%M:%S")

ArgumentError: invalid strptime format - `%Y/%m/%d %H:%M:%S'
from /path/to/.rbenv/versions/2.3.4/lib/ruby/2.3.0/time.rb:429:in `strptime'

YAMLファイル

in:
  type: file
  path_prefix: /private/tmp/sample/csv/sample_
#  decoders:
#  - {type: gzip}
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
#    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: time, type: string }
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out: {type: stdout}

filters:
  - type: timestamp_format
#    default_from_timestamp_format: ["yyyy-MM-dd HH:mm:ss"]
    default_from_timestamp_format: ["%Y-%m-%d %H:%M:%S"]
    columns:
      - {name: time, type: timestamp }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment