Skip to content

Instantly share code, notes, and snippets.

@berdario berdario/jis_enc.rb
Last active Aug 29, 2015

What would you like to do?
#!/usr/bin/env ruby
.select &:valid_encoding?
jis_only ={ |s| s.encode('UTF-8', undef: :replace) == "\uFFFD" }
# this will obtain only the valid characters that we can't convert to unicode
puts jis_only
# printing this is useless on a normal machine, since the locale uses an utf-8 encoding for the terminal, and thus we won't be able to have a look at these characters
# it's a mess:
# Ruby has these encodings for japanese
# "EUC-JISX0213"=>"EUC-JP-2004", "SJIS"=>"Windows-31J", "CP932"=>"Windows-31J"
# python has
# cp932 aliased to 932, ms932, mskanji, ms-kanji
# euc_jp aliased to eucjp, ujis, u-jis
# euc_jis_2004 aliased to jisx0213, eucjis2004
# euc_jisx0213 aliased to eucjisx0213
# shift_jis aliased to csshiftjis, shiftjis, sjis, s_jis
# shift_jis_2004 aliased to shiftjis2004, sjis_2004, sjis2004
# shift_jisx0213 aliased to shiftjisx0213, sjisx0213, s_jisx0213
# I'm ignoring iso2022 encodings, and only looking at *jis*, *0213, *932
# with cp932, python is able to decode \x87\x54 to 'Ⅰ'
# but ruby, only displays the hex code... and ruby's cp932 encoding seems to be the same as Shift-JIS
# with euc-jisx0213 b"\x8e\xe0" instead is apparently valid on ruby, but illegal on python... by looking on a conversion table, seems to be a reversed value
# this thread has interesting information on the issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.