Last active
August 29, 2015 14:12
-
-
Save berdario/9b6bd24cafe3817e4773 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env ruby | |
bytes=(0..255).to_a | |
valid=bytes.product(bytes) | |
.map{|a,b|(a.chr+b.chr).force_encoding('EUC-JISX0213')} | |
.select &:valid_encoding? | |
jis_only = valid.select{ |s| s.encode('UTF-8', undef: :replace) == "\uFFFD" } | |
# this will obtain only the valid characters that we can't convert to unicode | |
puts jis_only | |
# printing this is useless on a normal machine, since the locale uses an utf-8 encoding for the terminal, and thus we won't be able to have a look at these characters | |
# it's a mess: | |
# Ruby has these encodings for japanese | |
# "EUC-JISX0213"=>"EUC-JP-2004", "SJIS"=>"Windows-31J", "CP932"=>"Windows-31J" | |
# | |
# python has | |
# cp932 aliased to 932, ms932, mskanji, ms-kanji | |
# euc_jp aliased to eucjp, ujis, u-jis | |
# euc_jis_2004 aliased to jisx0213, eucjis2004 | |
# euc_jisx0213 aliased to eucjisx0213 | |
# shift_jis aliased to csshiftjis, shiftjis, sjis, s_jis | |
# shift_jis_2004 aliased to shiftjis2004, sjis_2004, sjis2004 | |
# shift_jisx0213 aliased to shiftjisx0213, sjisx0213, s_jisx0213 | |
# I'm ignoring iso2022 encodings, and only looking at *jis*, *0213, *932 | |
# with cp932, python is able to decode \x87\x54 to 'Ⅰ' | |
# but ruby, only displays the hex code... and ruby's cp932 encoding seems to be the same as Shift-JIS | |
# with euc-jisx0213 b"\x8e\xe0" instead is apparently valid on ruby, but illegal on python... by looking on a conversion table, seems to be a reversed value | |
# this thread has interesting information on the issue http://news.ycombinator.com/item?id=1162399 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment