Francesco149/linux_weird_zip_encodings.md

## linux_weird_zip_encodings.md

      
    Raw
  

              linux_weird_zip_encodings.md
            
          
    Detecting and dealing with weird file name encodings in zip files on Linux

A friend of mine recently asked me for help on a specific zip file that was
supposed to contain a bunch of files with japanese names.
No matter what was used to extract or what the system locale was set to, the
file names would end up garbled. We tried unzip and 7z to no avail.
After some googling I found guides that explained that you have to prevent 7z
from automatically converting to utf8 by unzipping with LC_ALL=C and
then use convmv to convert the file names to utf8.
The problem with this approach is that you need to know the original encoding
of the file names. I couldn't find any way to detect the file name encoding of
a zip file, and using python's encoding guesser on the file names didn't give
any correct guesses either.
We tried euc-jp (cp932) since the files were supposed to be japanese, but that
didn't seem to be it.
Tired from all the time wasted for a single zip file, I decided to write a
simple bash function that would iterate every possible codepage supported by
convmv and display what the files would rename to.
I wrote the function in my .bashrc:
bruteforce-charset() {
    for enc in `convmv --list`; do
        printf "%72s\n" | tr " " -
        echo "# Testing $enc"
        echo ""
        yes | convmv -f $enc -t utf8 -r $1
        printf "%72s\n" | tr " " -
        echo ""
    done
}
Then, I extracted the zip file while preserving the file name encoding and ran
the bruteforce function on the extracted files:
mkdir test
cd test
LC_ALL=C 7z x /path/to/file.zip
bruteforce-charset *
The output looks something like this:
(snip)

----------------------------------------------------------------------
# Testing euc-cn

Starting a dry run without changes...
mv "�?/09 �����׫뫺.mp3" "�?/09 岽掖�戢毛撰氆�.mp3"
mv "�?/04 ���.mp3"      "�?/04 攸绡伍.mp3"
mv "�?/03 �ݫë׫߫�?���ë��.mp3"     "�?/03 �莴毛撰攉�?�斧毛�皱.mp3"
mv "./�?"       "./脲?"
No changes to your files done. Use --notest to finally rename the file
----------------------------------------------------------------------

----------------------------------------------------------------------
# Testing euc-jp

Starting a dry run without changes...
mv "�?/09 �����׫뫺.mp3" "�?/09 甦甸̜ɚ̟̆ɒ.mp3"
mv "�?/04 ���.mp3"      "�?/04 戌膀礼.mp3"
mv "�?/03 �ݫë׫߫�?���ë��.mp3"     "�?/03 ̏ɚ̆̂��?ɔɚɵ帙.mp3"
mv "./�?"       "./諷?"
No changes to your files done. Use --notest to finally rename the file
----------------------------------------------------------------------

----------------------------------------------------------------------
# Testing euc-kr

Starting a dry run without changes...
mv "�?/09 �����׫뫺.mp3" "�?/09 少年リップルズ.mp3"
mv "�?/04 ���.mp3"      "�?/04 面影橋.mp3"
mv "�?/03 �ݫë׫߫�?���ë��.mp3"     "�?/03 ポップミュ?ジック論.mp3"
mv "./�?"       "./音?"
No changes to your files done. Use --notest to finally rename the file
----------------------------------------------------------------------

(snip)
All I did was scroll through the output until I saw meaningful file names. In
this case, the encoding was euc-kr, which tells me that the person who uploaded
this zipped the files on a korean copy of windows.
We can now fix the file names by running
convmv -f euc-kr -t utf8 -r --notest *
Japanese file names on korean encoding. What a nightmare.
If the zip format really does not store any information about the character
encoding I'm definitely going to strongly advise against using it, or at least
use an archiver that converts the file names to a fixed encoding.