Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip.compress(..., mtime=0) in cpython 3.11+ unexpectedly sets OS byte in gzip header #112346

Open
dennisvang opened this issue Nov 23, 2023 · 14 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@dennisvang
Copy link

dennisvang commented Nov 23, 2023

Bug report

description

Using gzip.compress() with mtime=0 in 3.8<=cpython<=3.10, the OS byte, i.e. the 10th byte in the GZIP header, is set to 255 "unknown" (also see e.g. #83302):

return struct.pack("<BBBBLBB", 0x1f, 0x8b, 8, 0, int(mtime), xfl, 255)

However, in cpython 3.11 and 3.12, the OS byte is suddenly set to a "known" value, e.g. 3 ("Unix") on Ubuntu.

This is not mentioned in the changelog for Python 3.11.

This may lead to problems in the context of reproducible builds. In our case, hash checking fails after decompressing and re-compressing a gzipped archive.

how to reproduce

Here's an example, where byte 10 is \xff in python 3.10 and \x03 in python 3.11:

~ $ python
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
>>> import gzip
>>> gzip.compress(b'', mtime=0)
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x02\xff\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'

~ $ pyenv shell 3.11
~ $ python
Python 3.11.6 (main, Nov 23 2023, 17:30:16) [GCC 11.4.0] on linux
>>> import gzip
>>> gzip.compress(b'', mtime=0)
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x02\x03\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'

cause

I guess this is caused by python 3.11 delegating the gzip.compress() call to zlib if mtime=0, as mentioned in the docs:

Changed in version 3.11: Speed is improved by compressing all data at once instead of in a streamed fashion. Calls with mtime set to 0 are delegated to zlib.compress() for better speed.

and source:

cpython/Lib/gzip.py

Lines 609 to 612 in 89ddea4

if mtime == 0:
# Use zlib as it creates the header with 0 mtime by default.
# This is faster and with less overhead.
return zlib.compress(data, level=compresslevel, wbits=31)

Apparently zlib does set the OS byte.

CPython versions tested on:

3.8, 3.9, 3.10, 3.11, 3.12

Operating systems tested on:

Linux, macOS, Windows

Linked PRs

@dennisvang dennisvang added the type-bug An unexpected behavior, bug, or error label Nov 23, 2023
@Eclips4 Eclips4 added the stdlib Python modules in the Lib dir label Nov 23, 2023
@dennisvang
Copy link
Author

dennisvang commented Nov 23, 2023

In itself I think it may be a good thing that the OS byte is properly set.

The problem is just that the change is not explicitly documented, as far as I know.

dennisvang added a commit to dennisvang/tufup that referenced this issue Nov 23, 2023
@rhpvorderman
Copy link
Contributor

Hi @dennisvang . This is my fault, as I delegated gzip.compress(mtime=0) to zlib.compress, incorrectly assuming this was the same. The reason is that zlib.compress is faster. But if it leads to behavioral changes, that is not acceptable.

I believe this can easily be remedied by removing the codepath.

@rhpvorderman
Copy link
Contributor

I have made a PR. Just now and put Bugfix in the name. Now I hope it will get attention.

@dennisvang
Copy link
Author

@rhpvorderman Thanks for picking this up.

I wonder, if this is the only side-effect, and if the performance gain from using zlib.compress is worth it, perhaps you could just keep delegating to zlib and change byte 10 back to \xff afterwards?

@rhpvorderman
Copy link
Contributor

Well, as mentioned in the PR, keeping two separate code paths caused issues before. It is best to keep one codepath. There is a mention in the documentation about zlib.compress so users who need the performance can use it themselves.

@dennisvang
Copy link
Author

@rhpvorderman You're right, that makes sense.

@rhpvorderman
Copy link
Contributor

ping
This bug and fix have been lingering for a while.

@serhiy-storchaka
Copy link
Member

For reference, this feature was added in bpo-43613 (gh-87779). It included more optimizations, the only issue with delegating the whole compression to zlib, when mtime is 0.

The fix looks correct and it still preserves some speed up. An alternate solution could be to call zlib.compress() (even if mtime is not 0) and then patch the result for mtime and the OS byte, but I do not know how reliable is it and whether method is faster.

rhpvorderman added a commit to rhpvorderman/cpython that referenced this issue Jun 12, 2024
@gpshead
Copy link
Member

gpshead commented Jun 12, 2024

Is this really a big deal? We won't be able to backport this to 3.11 as that is in security fix only mode.

In our case, hash checking fails after decompressing and re-compressing a gzipped archive.

zlib cannot be presumed to produce canonical output. There are many different zlib implementations.

Decompressing gzipped data that you did not produce and recompressing it without using identical software is already not guaranteed to produce the same compressed output.

From a reproducible build perspective I suggest always patching irrelevant fields such as gzip header mtime and OS fields to constant values as part of the build.

@dennisvang
Copy link
Author

Is this really a big deal? ...

@gpshead Probably not for most people.

As commented above, I was only hoping for a small note to be added to the changelog (or documentation).

Just in case someone does rely on the OS byte in the gzip header, in whatever context.

zlib cannot be presumed to produce canonical output. There are many different zlib implementations.
...
From a reproducible build perspective I suggest always patching irrelevant fields ...

Very true

I just mentioned the reproducible build example to provide some context, although it was an edge case involving files created on the same machine, with exact same zlib implementation, but a different python version.

@serhiy-storchaka
Copy link
Member

This may be not such big deal, but it is still a problem. mtime=0 is used to produce more reproducible output, but currently it is less reproducible than for non-zero mtime. We can discuss whether it should be backported to 3.11 or even to 3.12, but this is a bug that should be fixed in new releases.

Since mtime=0 is used to produce more reproducible output, we have the following options:

  • Always set the OS field to "unknown". This is the behavior before 3.11.
  • Set the OS field to "unknown" if mtime is zero, and to the OS specific value otherwise.
  • Always set the OS field to the OS specific value.

The current behavior is not included in reasonable options.

There are also two implementation options:

  • Use zlib.compress() only for the raw data and generate the header and the trailer in Python.
  • Use zlib.compress() to produce the full data and patch it (the mtime and the OS fields afterward).

This should be decided based on relative timing of these two methods.

For now, #114116 looks like a simple and safe option, but you are welcome to bikeshedding.

@rhpvorderman
Copy link
Contributor

This should be decided based on relative timing of these two methods.

The later is definitely quicker as the crc calculation also happens in one go. Using zlib.compress with wbits 31 and then always patching the header for more consistent results and should be a faster default path.

@rhpvorderman
Copy link
Contributor

I did some benchmarking and special-casing mtime=0 does not provide much benefit:

./python -m timeit -s 'import gzip; import zlib; data=b"Some arbitrary small data of reasonable size to be in-memory compressed. JSON API responses, tweets, stuff like that. Gewoon wat uit de nek kletsen tot er een redelijke hoeveelheid data is. Dat gaat makkelijker in mijn moedertaal uiteraard."' 'for i in range(100): gzip.compress(data, compresslevel=1, mtime=0)'
50 loops, best of 5: 5.45 msec per loop

/python -m timeit -s 'import gzip; import zlib; data=b"Some arbitrary small data of reasonable size to be in-memory compressed. JSON API responses, tweets, stuff like that. Gewoon wat uit de nek kletsen tot er een redelijke hoeveelheid data is. Dat gaat makkelijker in mijn moedertaal uiteraard."' 'for i in range(100): zlib.compress(data, level=1)'
50 loops, best of 5: 5.41 msec per loop

The problem is that I did not separately benchmark this codepath at the time, as it seemed to me that doing everything in C is obviously faster than using struct.pack in combination with building new bytes objects in memory. However, DEFLATE compression is apparently so expensive, even on level 1 that this does not matter.

I also made a new PR. That makes zlib always write the trailer and the header is simply replacing parts of the zlib header. The speed is the same, but it simplifies the code a lot, and always guarantees the OS byte being set to 255. There is no separate codepath for mtime=0. The xfl byte is now set by zlib, as it ideally should be. The resulting code should be easier to maintain going forward.

@gpshead
Copy link
Member

gpshead commented Jun 14, 2024

Always set the OS field to "unknown". This is the behavior before 3.11.

This is what I'd call our ideal behavior.

Where we can't do that, documenting that the field may be set to different values in different situations is worthwhile which my draft docs PR does.

I like the look of your #120486 change. We can back-port that to 3.13 as it is fine to make such a change during the beta period.

gpshead pushed a commit that referenced this issue Jun 15, 2024
…GH-120486)

This matches the output behavior in 3.10 and earlier; the optimization in 3.11 allowed the zlib library's "os" value to be filled in instead in the circumstance when mtime was 0.  this keeps things consistent.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 15, 2024
…ction. (pythonGH-120486)

This matches the output behavior in 3.10 and earlier; the optimization in 3.11 allowed the zlib library's "os" value to be filled in instead in the circumstance when mtime was 0.  this keeps things consistent.
(cherry picked from commit 08d09cf)

Co-authored-by: Ruben Vorderman <[email protected]>
gpshead pushed a commit that referenced this issue Jun 15, 2024
…nction. (GH-120486) (#120563)

gh-112346: Always set OS byte to 255, simpler gzip.compress function. (GH-120486)

This matches the output behavior in 3.10 and earlier; the optimization in 3.11 allowed the zlib library's "os" value to be filled in instead in the circumstance when mtime was 0.  this keeps things consistent.
(cherry picked from commit 08d09cf)

Co-authored-by: Ruben Vorderman <[email protected]>
gpshead added a commit that referenced this issue Jun 17, 2024
….11 (#120480)

gh-112346: Describe the "os" byte in gzip output change.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 17, 2024
…e in 3.11 (pythonGH-120480)

(cherry picked from commit bac4eda)

Co-authored-by: Gregory P. Smith <[email protected]>
pythongh-112346: Describe the "os" byte in gzip output change.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 17, 2024
…e in 3.11 (pythonGH-120480)

(cherry picked from commit bac4eda)

Co-authored-by: Gregory P. Smith <[email protected]>
pythongh-112346: Describe the "os" byte in gzip output change.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 17, 2024
…e in 3.11 (pythonGH-120480)

(cherry picked from commit bac4eda)

Co-authored-by: Gregory P. Smith <[email protected]>
pythongh-112346: Describe the "os" byte in gzip output change.
gpshead added a commit that referenced this issue Jun 17, 2024
…ge in 3.11 (GH-120480) (#120613)

gh-112346: Document the OS byte in `gzip.compress` output change in 3.11 (GH-120480)
(cherry picked from commit bac4eda)


gh-112346: Describe the "os" byte in gzip output change.

Co-authored-by: Gregory P. Smith <[email protected]>
gpshead added a commit that referenced this issue Jun 17, 2024
…ge in 3.11 (GH-120480) (#120612)

gh-112346: Document the OS byte in `gzip.compress` output change in 3.11 (GH-120480)
(cherry picked from commit bac4eda)


gh-112346: Describe the "os" byte in gzip output change.

Co-authored-by: Gregory P. Smith <[email protected]>
mrahtz pushed a commit to mrahtz/cpython that referenced this issue Jun 30, 2024
…ction. (pythonGH-120486)

This matches the output behavior in 3.10 and earlier; the optimization in 3.11 allowed the zlib library's "os" value to be filled in instead in the circumstance when mtime was 0.  this keeps things consistent.
mrahtz pushed a commit to mrahtz/cpython that referenced this issue Jun 30, 2024
…e in 3.11 (python#120480)

pythongh-112346: Describe the "os" byte in gzip output change.
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
…ction. (pythonGH-120486)

This matches the output behavior in 3.10 and earlier; the optimization in 3.11 allowed the zlib library's "os" value to be filled in instead in the circumstance when mtime was 0.  this keeps things consistent.
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
…e in 3.11 (python#120480)

pythongh-112346: Describe the "os" byte in gzip output change.
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
…ction. (pythonGH-120486)

This matches the output behavior in 3.10 and earlier; the optimization in 3.11 allowed the zlib library's "os" value to be filled in instead in the circumstance when mtime was 0.  this keeps things consistent.
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
…e in 3.11 (python#120480)

pythongh-112346: Describe the "os" byte in gzip output change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: No status
Development

No branches or pull requests

5 participants