Are you arguing that the zip implementation in Python should adhere to the zip(1) behavior instead of the zip specification?
An advantage of the Python implementation is that zip archives are portable across systems. This is not the case with the Linux implementation.
For example, using the example t.zip (created on a Linux system) I get this error if I attempt to extract it on my Mac.
$unzip ../t.zip Archive: ../t.zip error: cannot create test.txt Illegal byte sequence
Here is a blog post that goes into detail about the issues with platform specific zip implementations:
Your post might be better titled: "The Incredible Disaster of Platform Specific Implementations of Zip" :-)
From the post:
The Zip spec says that the only supported encodings are CP437 and UTF-8, but everyone has ignored that. Implementers just encode file names however they want (usually byte for byte as they are in the OS… see table below).
With regard to your comment:
Furthermore, this doesn’t explain the corruption that extractall() causes.
Let's take a look starting with a zip file similar to the one you create in the bug report:
$ echo hi > "$(printf 'test\xf7.txt')" $ zip t.zip *.txt adding: test.txt (stored 0%)
Now let's look at the flag for that file in Python:
$python3 Python 3.7.3 (default, Apr 3 2019, 05:39:12) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from zipfile import ZipFile >>> zf = ZipFile("t.zip") >>> zf.infolist().flag_bits & 0x800 and print("UTF") or print("Not UTF") Not UTF >>>
Looking at the python implementation:
if flags & 0x800: # UTF-8 file names extension filename = filename.decode('utf-8') else: # Historical ZIP filename encoding filename = filename.decode('cp437')
We can see that the filenames will be decoded with
cp437 - that's where the corruption comes from.
Given the above information, would you agree that the issues you are having are due to your expectation of the behavior of Zip files and not really issues with Python?
The Python implementation is in perfect compliance with the Zip specification and results in archives that are portable across systems.