Skip to content

Instantly share code, notes, and snippets.

@berdario berdario/jis_enc.rb
Last active Aug 29, 2015

Embed
What would you like to do?
#!/usr/bin/env ruby
bytes=(0..255).to_a
valid=bytes.product(bytes)
.map{|a,b|(a.chr+b.chr).force_encoding('EUC-JISX0213')}
.select &:valid_encoding?
jis_only = valid.select{ |s| s.encode('UTF-8', undef: :replace) == "\uFFFD" }
# this will obtain only the valid characters that we can't convert to unicode
puts jis_only
# printing this is useless on a normal machine, since the locale uses an utf-8 encoding for the terminal, and thus we won't be able to have a look at these characters
# it's a mess:
# Ruby has these encodings for japanese
# "EUC-JISX0213"=>"EUC-JP-2004", "SJIS"=>"Windows-31J", "CP932"=>"Windows-31J"
#
# python has
# cp932 aliased to 932, ms932, mskanji, ms-kanji
# euc_jp aliased to eucjp, ujis, u-jis
# euc_jis_2004 aliased to jisx0213, eucjis2004
# euc_jisx0213 aliased to eucjisx0213
# shift_jis aliased to csshiftjis, shiftjis, sjis, s_jis
# shift_jis_2004 aliased to shiftjis2004, sjis_2004, sjis2004
# shift_jisx0213 aliased to shiftjisx0213, sjisx0213, s_jisx0213
# I'm ignoring iso2022 encodings, and only looking at *jis*, *0213, *932
# with cp932, python is able to decode \x87\x54 to 'Ⅰ'
# but ruby, only displays the hex code... and ruby's cp932 encoding seems to be the same as Shift-JIS
# with euc-jisx0213 b"\x8e\xe0" instead is apparently valid on ruby, but illegal on python... by looking on a conversion table, seems to be a reversed value
# this thread has interesting information on the issue http://news.ycombinator.com/item?id=1162399
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.