Skip to content

Instantly share code, notes, and snippets.

@dtinth
Created June 7, 2017 17:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dtinth/ff8e00b6a6e01debbd1513bba5021b3f to your computer and use it in GitHub Desktop.
Save dtinth/ff8e00b6a6e01debbd1513bba5021b3f to your computer and use it in GitHub Desktop.

When extracting .zip files with Japanese encoding using p7zip in UTF-8 locale, a double-encoded file name is created:

\u0082±\u0082ñ\u0082É\u0082¿\u0082Í\u0081I.txt

Upon closer inspection, some code-points are > 127 and all of them < 256.

> filename.codepoints
=> [227, 129, 147, 227, 130, 147, 227, 129, 171, 227, 129, 161, 227, 129, 175, 239, 188, 129, 46, 116, 120, 116]

They can be…

> filename.codepoints
    .pack('c*')                     # ...interpreted as bytes
    .force_encoding('CP932')        # ...of CP932 encoding
    .encode('UTF-8')
=> "こんにちは!.txt"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment