Skip to content

Instantly share code, notes, and snippets.



Last active Aug 29, 2015
What would you like to do?
#!/usr/bin/env ruby
.select &:valid_encoding?
jis_only ={ |s| s.encode('UTF-8', undef: :replace) == "\uFFFD" }
# this will obtain only the valid characters that we can't convert to unicode
puts jis_only
# printing this is useless on a normal machine, since the locale uses an utf-8 encoding for the terminal, and thus we won't be able to have a look at these characters
# it's a mess:
# Ruby has these encodings for japanese
# "EUC-JISX0213"=>"EUC-JP-2004", "SJIS"=>"Windows-31J", "CP932"=>"Windows-31J"
# python has
# cp932 aliased to 932, ms932, mskanji, ms-kanji
# euc_jp aliased to eucjp, ujis, u-jis
# euc_jis_2004 aliased to jisx0213, eucjis2004
# euc_jisx0213 aliased to eucjisx0213
# shift_jis aliased to csshiftjis, shiftjis, sjis, s_jis
# shift_jis_2004 aliased to shiftjis2004, sjis_2004, sjis2004
# shift_jisx0213 aliased to shiftjisx0213, sjisx0213, s_jisx0213
# I'm ignoring iso2022 encodings, and only looking at *jis*, *0213, *932
# with cp932, python is able to decode \x87\x54 to 'Ⅰ'
# but ruby, only displays the hex code... and ruby's cp932 encoding seems to be the same as Shift-JIS
# with euc-jisx0213 b"\x8e\xe0" instead is apparently valid on ruby, but illegal on python... by looking on a conversion table, seems to be a reversed value
# this thread has interesting information on the issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment