Skip to content

Instantly share code, notes, and snippets.

@Francesco149
Last active July 12, 2018 18:19
Show Gist options
  • Save Francesco149/55a3f0332f9d5c9be265c478d3ef29ab to your computer and use it in GitHub Desktop.
Save Francesco149/55a3f0332f9d5c9be265c478d3ef29ab to your computer and use it in GitHub Desktop.

Detecting and dealing with weird file name encodings in zip files on Linux

A friend of mine recently asked me for help on a specific zip file that was supposed to contain a bunch of files with japanese names.

No matter what was used to extract or what the system locale was set to, the file names would end up garbled. We tried unzip and 7z to no avail.

After some googling I found guides that explained that you have to prevent 7z from automatically converting to utf8 by unzipping with LC_ALL=C and then use convmv to convert the file names to utf8.

The problem with this approach is that you need to know the original encoding of the file names. I couldn't find any way to detect the file name encoding of a zip file, and using python's encoding guesser on the file names didn't give any correct guesses either.

We tried euc-jp (cp932) since the files were supposed to be japanese, but that didn't seem to be it.

Tired from all the time wasted for a single zip file, I decided to write a simple bash function that would iterate every possible codepage supported by convmv and display what the files would rename to.

I wrote the function in my .bashrc:

bruteforce-charset() {
    for enc in `convmv --list`; do
        printf "%72s\n" | tr " " -
        echo "# Testing $enc"
        echo ""
        yes | convmv -f $enc -t utf8 -r $1
        printf "%72s\n" | tr " " -
        echo ""
    done
}

Then, I extracted the zip file while preserving the file name encoding and ran the bruteforce function on the extracted files:

mkdir test
cd test
LC_ALL=C 7z x /path/to/file.zip
bruteforce-charset *

The output looks something like this:

(snip)

----------------------------------------------------------------------
# Testing euc-cn

Starting a dry run without changes...
mv "�?/09 �����׫뫺.mp3" "�?/09 岽掖�戢毛撰氆�.mp3"
mv "�?/04 ���.mp3"      "�?/04 攸绡伍.mp3"
mv "�?/03 �ݫë׫߫�?���ë��.mp3"     "�?/03 �莴毛撰攉�?�斧毛�皱.mp3"
mv "./�?"       "./脲?"
No changes to your files done. Use --notest to finally rename the file
----------------------------------------------------------------------

----------------------------------------------------------------------
# Testing euc-jp

Starting a dry run without changes...
mv "�?/09 �����׫뫺.mp3" "�?/09 甦甸̜ɚ̟̆ɒ.mp3"
mv "�?/04 ���.mp3"      "�?/04 戌膀礼.mp3"
mv "�?/03 �ݫë׫߫�?���ë��.mp3"     "�?/03 ̏ɚ̆̂��?ɔɚɵ帙.mp3"
mv "./�?"       "./諷?"
No changes to your files done. Use --notest to finally rename the file
----------------------------------------------------------------------

----------------------------------------------------------------------
# Testing euc-kr

Starting a dry run without changes...
mv "�?/09 �����׫뫺.mp3" "�?/09 少年リップルズ.mp3"
mv "�?/04 ���.mp3"      "�?/04 面影橋.mp3"
mv "�?/03 �ݫë׫߫�?���ë��.mp3"     "�?/03 ポップミュ?ジック論.mp3"
mv "./�?"       "./音?"
No changes to your files done. Use --notest to finally rename the file
----------------------------------------------------------------------

(snip)

All I did was scroll through the output until I saw meaningful file names. In this case, the encoding was euc-kr, which tells me that the person who uploaded this zipped the files on a korean copy of windows.

We can now fix the file names by running convmv -f euc-kr -t utf8 -r --notest *

Japanese file names on korean encoding. What a nightmare.

If the zip format really does not store any information about the character encoding I'm definitely going to strongly advise against using it, or at least use an archiver that converts the file names to a fixed encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment