Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster decompression of gzip files #95534

Closed
rhpvorderman opened this issue Aug 1, 2022 · 5 comments
Closed

Faster decompression of gzip files #95534

rhpvorderman opened this issue Aug 1, 2022 · 5 comments
Labels
3.12 bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@rhpvorderman
Copy link
Contributor

rhpvorderman commented Aug 1, 2022

Pitch

Decompressing gzip streams is an extremely common practice. Most web browsers support gzip decompression, as such most (virtually all) servers return gzip compressed data (when the gzip support is advertised via headers). Tar.gz files are an extremely common way to archive files. Zip files use internal gzip compression.

Speeding this up by a non-trivial amount is therefore very advantageous.

Feature or enhancement

The current gzip reading pipeline can be improved quite a lot. This is the current way of doing things:

  • read io.DEFAULT_BUFFER_SIZE of data from a _PaddedFile object
  • feed it to a zlib.decompressobj() using the decompress(raw_data, size) function.
  • Internally decompress always starts with a 16KB buffer, regardless of the requested size. When the output data is 64KB big, it will need to resize 2 times.
  • The decompressed data is returned, anything not returned is saved in an unconsumed_tail object.
  • The unconsumed_tail is used to rewind the _PaddedFile object to the correct position.
  • The decompressed data length and crc32 are taken and these are used to update the GzipReader state.

This has some severe disadvantages when reading large blocks:

  • Gzip compresses between 50 and 90% in most cases. This means that the decompress function is going to return anywhere between 16 and 80KB maximum. When 128 KB is requested from the read function this means the 128 KB is not going to be filled, leading to unnecessary read requests.
  • In the above case say the typical return size is 37kb. That means due to the DEFAULT_BUFFER_SIZE in zlib being 16KB there will be two calls to resize the memory of the return object. (16->32->64). EDIT: actually it is three, since the end product is also resized to fit the contents (16->32->64->37).

This also has some severe disadvantages when reading small blocks.

  • When reading individual lines from a file (quite common) this actually queries a io.BufferedReader instance which reads from the _GzipReader in io.DEFAULT_BUFFER_SIZE chunks.
  • This means only 8KB requested. But still, 8KB is read from the _PaddedFile object. With typical compression rates this means anywhere between 4KB to 7KB of unconsumed tail will be returned. This creates a new object. Meaning allocating memory again. The same data is reread in the next iteration. This means in case of a 70% decompression rate, the same data is memcpy'd around 2 to 3 times.
  • Continuous rewinding of the _PaddedFile object

How to improve this:

  • Use the structure of the _bz2.Bz2Decompressor. Submitted data is copied to an internal buffer, data from this internal buffer is used to return decompress calls. This has the advantage of not endlessly allocating and destroing uncompressed_tail objects.
  • Read 128KB at once from the _PaddedFile object. And only read this data when the decompressor's needs_input attribute is True. This prevents querying the _PaddedFile object too much.
  • When decompress is called and a maxsize is given (say 128KB) allocate maxsize immediately, instead of allocating only 16KB and resizing later.

This prevents a lot of calls to the Python memory allocator.

This restructuring has already been implemented in python-isal. That project is a modification of zlibmodule.c to use the ISA-L optimizations. While this did improve speed, I also looked at other ways to improve the performance. By restructuring the gzip module and the zlib code the Python overhead was significantly reduced.

Relevant code:

Most of this code can be seemlessly copied back into CPython. Which I will do when I have the time. This can best be done after the 3.11 release I think.

Previous discussion

NA. This is a performance enhancement, so not necessarily a new feature, but also not a bug.

@rhpvorderman rhpvorderman added the type-feature A feature request or enhancement label Aug 1, 2022
@gpshead
Copy link
Member

gpshead commented Aug 1, 2022

PRs welcome!

@corona10
Copy link
Member

corona10 commented Aug 1, 2022

PRs welcome!

Plus, would you like to provide a microbenchmark by using pyperf?

@mdboom mdboom added the performance Performance or resource usage label Aug 1, 2022
@rhpvorderman
Copy link
Contributor Author

Thank you for your enthusiasm! I will put it on my todo list right away.

Sure I can look and see if I can provide a microbenchmark.I have to say though: these changes really manifest when decompressing gzip files with bioinformatics data. Normal sizes for us are 10-20GB for Whole Exome Sequencing or RNA Sequencing, with ~100GB for Whole Genome Sequencing. So that is "Real-World" gzip data for me (you can see why this topic has my interest). That is not really "micro" though. I will look around for a suitable real-world case. I am currently thinking of decompressing tar.gz files.

@iritkatriel iritkatriel added 3.12 bugs and security fixes stdlib Python modules in the Lib dir labels Sep 24, 2022
rhpvorderman added a commit to rhpvorderman/cpython that referenced this issue Sep 30, 2022
@rhpvorderman
Copy link
Contributor Author

I made a PR. Microbenchmarks and results here for the interested:

./python -m pyperf timeit -s "import gzip; g=gzip.open('cpython-3.10.7.tar.gz', 'rb'); it=iter(lambda:g.read(128*1024), b'');" "for _ in it: pass"
.....................
Mean +- std dev: 301 ms +- 2 ms
./python -m pyperf timeit -s "import gzip; g=gzip.open('cpython-3.10.7.tar.gz', 'rb'); it=iter(lambda:g.read(128*1024), b'');" "for _ in it: pass"
.....................
Mean +- std dev: 270 ms +- 1 ms

So a 10% performance improvement. Given that most of the work is done in the zlib library, this is a substantial reduction of overhead costs.

gpshead pushed a commit that referenced this issue Oct 17, 2022
Change summary:
+ There is now a `gzip.READ_BUFFER_SIZE` constant that is 128KB. Other programs that read in 128KB chunks: pigz and cat. So this seems best practice among good programs. Also it is faster than 8 kb chunks.
+ a zlib._ZlibDecompressor was added. This is the _bz2.BZ2Decompressor ported to zlib. Since the zlib.Decompress object is better for in-memory decompression, the _ZlibDecompressor is hidden. It only makes sense in file decompression, and that is already implemented now in the gzip library. No need to bother the users with this.
+ The ZlibDecompressor uses the older Cpython arrange_output_buffer functions, as those are faster and more appropriate for the use case. 
+ GzipFile.read has been optimized. There is no longer a `unconsumed_tail` member to write back to padded file. This is instead handled by the ZlibDecompressor itself, which has an internal buffer. `_add_read_data` has been inlined, as it was just two calls.

EDIT: While I am adding improvements anyway, I figured I could add another one-liner optimization now to the python -m gzip application. That read chunks in io.DEFAULT_BUFFER_SIZE previously, but has been updated now to use READ_BUFFER_SIZE chunks.
@gpshead
Copy link
Member

gpshead commented Oct 17, 2022

Thanks for doing this!

@gpshead gpshead closed this as completed Oct 17, 2022
carljm added a commit to carljm/cpython that referenced this issue Oct 17, 2022
* main: (31 commits)
  pythongh-95913: Move subinterpreter exper removal to 3.11 WhatsNew (pythonGH-98345)
  pythongh-95914: Add What's New item describing PEP 670 changes (python#98315)
  Remove unused arrange_output_buffer function from zlibmodule.c. (pythonGH-98358)
  pythongh-98174: Handle EPROTOTYPE under macOS in test_sendfile_fallback_close_peer_in_the_middle_of_receiving (python#98316)
  pythonGH-98327: Reduce scope of catch_warnings() in _make_subprocess_transport (python#98333)
  pythongh-93691: Compiler's code-gen passes location around instead of holding it on the global compiler state (pythonGH-98001)
  pythongh-97669: Create Tools/build/ directory (python#97963)
  pythongh-95534: Improve gzip reading speed by 10% (python#97664)
  pythongh-95913: Forward-port int/str security change to 3.11 What's New in main (python#98344)
  pythonGH-91415: Mention alphabetical sort ordering in the Sorting HOWTO (pythonGH-98336)
  pythongh-97930: Merge with importlib_resources 5.9 (pythonGH-97929)
  pythongh-85525: Remove extra row in doc (python#98337)
  pythongh-85299: Add note warning about entry point guard for asyncio example (python#93457)
  pythongh-97527: IDLE - fix buggy macosx patch (python#98313)
  pythongh-98307: Add docstring and documentation for SysLogHandler.createSocket (pythonGH-98319)
  pythongh-94808: Cover `PyFunction_GetCode`, `PyFunction_GetGlobals`, `PyFunction_GetModule` (python#98158)
  pythonGH-94597: Deprecate child watcher getters and setters (python#98215)
  pythongh-98254: Include stdlib module names in error messages for NameErrors (python#98255)
  Improve speed. Reduce auxiliary memory to 16.6% of the main array. (pythonGH-98294)
  [doc] Update logging cookbook with an example of custom handling of levels. (pythonGH-98290)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants