Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
The Incredible Disaster of Platform Specific Implementations of Zip

Are you arguing that the zip implementation in Python should adhere to the zip(1) behavior instead of the zip specification?

An advantage of the Python implementation is that zip archives are portable across systems. This is not the case with the Linux implementation.

For example, using the example t.zip (created on a Linux system) I get this error if I attempt to extract it on my Mac.

$unzip ../t.zip
Archive:  ../t.zip
error:  cannot create test.txt
        Illegal byte sequence

Here is a blog post that goes into detail about the issues with platform specific zip implementations:

https://marcosc.com/2008/12/zip-files-and-encoding-i-hate-you/

Your post might be better titled: "The Incredible Disaster of Platform Specific Implementations of Zip" :-)

From the post:

The Zip spec says that the only supported encodings are CP437 and UTF-8, but everyone has ignored that. Implementers just encode file names however they want (usually byte for byte as they are in the OS… see table below).

With regard to your comment:

Furthermore, this doesn’t explain the corruption that extractall() causes.

Let's take a look starting with a zip file similar to the one you create in the bug report:

$ echo hi > "$(printf 'test\xf7.txt')"
$ zip t.zip *.txt
  adding: test.txt (stored 0%)

Now let's look at the flag for that file in Python:

$python3
Python 3.7.3 (default, Apr  3 2019, 05:39:12)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from zipfile import ZipFile
>>> zf = ZipFile("t.zip")
>>> zf.infolist()[0].flag_bits & 0x800 and print("UTF") or print("Not UTF")
Not UTF
>>>

Looking at the python implementation:

if flags & 0x800:
  # UTF-8 file names extension
  filename = filename.decode('utf-8')
else:
  # Historical ZIP filename encoding
  filename = filename.decode('cp437')

We can see that the filenames will be decoded with cp437 - that's where the corruption comes from.

Given the above information, would you agree that the issues you are having are due to your expectation of the behavior of Zip files and not really issues with Python?

The Python implementation is in perfect compliance with the Zip specification and results in archives that are portable across systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.