Skip to content

Commit

Permalink
bpo-43613: Faster implementation of gzip.compress and gzip.decompress (
Browse files Browse the repository at this point in the history
…pythonGH-27941)

Co-authored-by: Łukasz Langa <[email protected]>
  • Loading branch information
rhpvorderman and ambv authored Sep 2, 2021
1 parent a7ef15a commit ea23e78
Show file tree
Hide file tree
Showing 8 changed files with 193 additions and 87 deletions.
17 changes: 14 additions & 3 deletions Doc/library/gzip.rst
Original file line number Diff line number Diff line change
Expand Up @@ -174,19 +174,30 @@ The module defines the following items:

Compress the *data*, returning a :class:`bytes` object containing
the compressed data. *compresslevel* and *mtime* have the same meaning as in
the :class:`GzipFile` constructor above.
the :class:`GzipFile` constructor above. When *mtime* is set to ``0``, this
function is equivalent to :func:`zlib.compress` with *wbits* set to ``31``.
The zlib function is faster.

.. versionadded:: 3.2
.. versionchanged:: 3.8
Added the *mtime* parameter for reproducible output.
.. versionchanged:: 3.11
Speed is improved by compressing all data at once instead of in a
streamed fashion. Calls with *mtime* set to ``0`` are delegated to
:func:`zlib.compress` for better speed.

.. function:: decompress(data)

Decompress the *data*, returning a :class:`bytes` object containing the
uncompressed data.
uncompressed data. This function is capable of decompressing multi-member
gzip data (multiple gzip blocks concatenated together). When the data is
certain to contain only one member the :func:`zlib.decompress` function with
*wbits* set to 31 is faster.

.. versionadded:: 3.2

.. versionchanged:: 3.11
Speed is improved by decompressing members at once in memory instead of in
a streamed fashion.

.. _gzip-usage-examples:

Expand Down
46 changes: 28 additions & 18 deletions Doc/library/zlib.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,19 +47,43 @@ The available exception and functions in this module are:
platforms, use ``adler32(data) & 0xffffffff``.


.. function:: compress(data, /, level=-1)
.. function:: compress(data, /, level=-1, wbits=MAX_WBITS)

Compresses the bytes in *data*, returning a bytes object containing compressed data.
*level* is an integer from ``0`` to ``9`` or ``-1`` controlling the level of compression;
``1`` (Z_BEST_SPEED) is fastest and produces the least compression, ``9`` (Z_BEST_COMPRESSION)
is slowest and produces the most. ``0`` (Z_NO_COMPRESSION) is no compression.
The default value is ``-1`` (Z_DEFAULT_COMPRESSION). Z_DEFAULT_COMPRESSION represents a default
compromise between speed and compression (currently equivalent to level 6).

.. _compress-wbits:

The *wbits* argument controls the size of the history buffer (or the
"window size") used when compressing data, and whether a header and
trailer is included in the output. It can take several ranges of values,
defaulting to ``15`` (MAX_WBITS):

* +9 to +15: The base-two logarithm of the window size, which
therefore ranges between 512 and 32768. Larger values produce
better compression at the expense of greater memory usage. The
resulting output will include a zlib-specific header and trailer.

* −9 to −15: Uses the absolute value of *wbits* as the
window size logarithm, while producing a raw output stream with no
header or trailing checksum.

* +25 to +31 = 16 + (9 to 15): Uses the low 4 bits of the value as the
window size logarithm, while including a basic :program:`gzip` header
and trailing checksum in the output.

Raises the :exc:`error` exception if any error occurs.

.. versionchanged:: 3.6
*level* can now be used as a keyword parameter.

.. versionchanged:: 3.11
The *wbits* parameter is now available to set window bits and
compression type.

.. function:: compressobj(level=-1, method=DEFLATED, wbits=MAX_WBITS, memLevel=DEF_MEM_LEVEL, strategy=Z_DEFAULT_STRATEGY[, zdict])

Expand All @@ -76,23 +100,9 @@ The available exception and functions in this module are:
*method* is the compression algorithm. Currently, the only supported value is
:const:`DEFLATED`.

The *wbits* argument controls the size of the history buffer (or the
"window size") used when compressing data, and whether a header and
trailer is included in the output. It can take several ranges of values,
defaulting to ``15`` (MAX_WBITS):

* +9 to +15: The base-two logarithm of the window size, which
therefore ranges between 512 and 32768. Larger values produce
better compression at the expense of greater memory usage. The
resulting output will include a zlib-specific header and trailer.

* −9 to −15: Uses the absolute value of *wbits* as the
window size logarithm, while producing a raw output stream with no
header or trailing checksum.

* +25 to +31 = 16 + (9 to 15): Uses the low 4 bits of the value as the
window size logarithm, while including a basic :program:`gzip` header
and trailing checksum in the output.
The *wbits* parameter controls the size of the history buffer (or the
"window size"), and what header and trailer format will be used. It has
the same meaning as `described for compress() <#compress-wbits>`__.

The *memLevel* argument controls the amount of memory used for the
internal compression state. Valid values range from ``1`` to ``9``.
Expand Down
161 changes: 108 additions & 53 deletions Lib/gzip.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,59 @@ def __iter__(self):
return self._buffer.__iter__()


def _read_exact(fp, n):
'''Read exactly *n* bytes from `fp`
This method is required because fp may be unbuffered,
i.e. return short reads.
'''
data = fp.read(n)
while len(data) < n:
b = fp.read(n - len(data))
if not b:
raise EOFError("Compressed file ended before the "
"end-of-stream marker was reached")
data += b
return data


def _read_gzip_header(fp):
'''Read a gzip header from `fp` and progress to the end of the header.
Returns last mtime if header was present or None otherwise.
'''
magic = fp.read(2)
if magic == b'':
return None

if magic != b'\037\213':
raise BadGzipFile('Not a gzipped file (%r)' % magic)

(method, flag, last_mtime) = struct.unpack("<BBIxx", _read_exact(fp, 8))
if method != 8:
raise BadGzipFile('Unknown compression method')

if flag & FEXTRA:
# Read & discard the extra field, if present
extra_len, = struct.unpack("<H", _read_exact(fp, 2))
_read_exact(fp, extra_len)
if flag & FNAME:
# Read and discard a null-terminated string containing the filename
while True:
s = fp.read(1)
if not s or s==b'\000':
break
if flag & FCOMMENT:
# Read and discard a null-terminated string containing a comment
while True:
s = fp.read(1)
if not s or s==b'\000':
break
if flag & FHCRC:
_read_exact(fp, 2) # Read & discard the 16-bit header CRC
return last_mtime


class _GzipReader(_compression.DecompressReader):
def __init__(self, fp):
super().__init__(_PaddedFile(fp), zlib.decompressobj,
Expand All @@ -415,53 +468,11 @@ def _init_read(self):
self._crc = zlib.crc32(b"")
self._stream_size = 0 # Decompressed size of unconcatenated stream

def _read_exact(self, n):
'''Read exactly *n* bytes from `self._fp`
This method is required because self._fp may be unbuffered,
i.e. return short reads.
'''

data = self._fp.read(n)
while len(data) < n:
b = self._fp.read(n - len(data))
if not b:
raise EOFError("Compressed file ended before the "
"end-of-stream marker was reached")
data += b
return data

def _read_gzip_header(self):
magic = self._fp.read(2)
if magic == b'':
last_mtime = _read_gzip_header(self._fp)
if last_mtime is None:
return False

if magic != b'\037\213':
raise BadGzipFile('Not a gzipped file (%r)' % magic)

(method, flag,
self._last_mtime) = struct.unpack("<BBIxx", self._read_exact(8))
if method != 8:
raise BadGzipFile('Unknown compression method')

if flag & FEXTRA:
# Read & discard the extra field, if present
extra_len, = struct.unpack("<H", self._read_exact(2))
self._read_exact(extra_len)
if flag & FNAME:
# Read and discard a null-terminated string containing the filename
while True:
s = self._fp.read(1)
if not s or s==b'\000':
break
if flag & FCOMMENT:
# Read and discard a null-terminated string containing a comment
while True:
s = self._fp.read(1)
if not s or s==b'\000':
break
if flag & FHCRC:
self._read_exact(2) # Read & discard the 16-bit header CRC
self._last_mtime = last_mtime
return True

def read(self, size=-1):
Expand Down Expand Up @@ -524,7 +535,7 @@ def _read_eof(self):
# We check that the computed CRC and size of the
# uncompressed data matches the stored values. Note that the size
# stored is the true file size mod 2**32.
crc32, isize = struct.unpack("<II", self._read_exact(8))
crc32, isize = struct.unpack("<II", _read_exact(self._fp, 8))
if crc32 != self._crc:
raise BadGzipFile("CRC check failed %s != %s" % (hex(crc32),
hex(self._crc)))
Expand All @@ -544,21 +555,65 @@ def _rewind(self):
super()._rewind()
self._new_member = True


def _create_simple_gzip_header(compresslevel: int,
mtime = None) -> bytes:
"""
Write a simple gzip header with no extra fields.
:param compresslevel: Compresslevel used to determine the xfl bytes.
:param mtime: The mtime (must support conversion to a 32-bit integer).
:return: A bytes object representing the gzip header.
"""
if mtime is None:
mtime = time.time()
if compresslevel == _COMPRESS_LEVEL_BEST:
xfl = 2
elif compresslevel == _COMPRESS_LEVEL_FAST:
xfl = 4
else:
xfl = 0
# Pack ID1 and ID2 magic bytes, method (8=deflate), header flags (no extra
# fields added to header), mtime, xfl and os (255 for unknown OS).
return struct.pack("<BBBBLBB", 0x1f, 0x8b, 8, 0, int(mtime), xfl, 255)


def compress(data, compresslevel=_COMPRESS_LEVEL_BEST, *, mtime=None):
"""Compress data in one shot and return the compressed string.
Optional argument is the compression level, in range of 0-9.
compresslevel sets the compression level in range of 0-9.
mtime can be used to set the modification time. The modification time is
set to the current time by default.
"""
buf = io.BytesIO()
with GzipFile(fileobj=buf, mode='wb', compresslevel=compresslevel, mtime=mtime) as f:
f.write(data)
return buf.getvalue()
if mtime == 0:
# Use zlib as it creates the header with 0 mtime by default.
# This is faster and with less overhead.
return zlib.compress(data, level=compresslevel, wbits=31)
header = _create_simple_gzip_header(compresslevel, mtime)
trailer = struct.pack("<LL", zlib.crc32(data), (len(data) & 0xffffffff))
# Wbits=-15 creates a raw deflate block.
return header + zlib.compress(data, wbits=-15) + trailer


def decompress(data):
"""Decompress a gzip compressed string in one shot.
Return the decompressed string.
"""
with GzipFile(fileobj=io.BytesIO(data)) as f:
return f.read()
decompressed_members = []
while True:
fp = io.BytesIO(data)
if _read_gzip_header(fp) is None:
return b"".join(decompressed_members)
# Use a zlib raw deflate compressor
do = zlib.decompressobj(wbits=-zlib.MAX_WBITS)
# Read all the data except the header
decompressed = do.decompress(data[fp.tell():])
crc, length = struct.unpack("<II", do.unused_data[:8])
if crc != zlib.crc32(decompressed):
raise BadGzipFile("CRC check failed")
if length != (len(decompressed) & 0xffffffff):
raise BadGzipFile("Incorrect length of data produced")
decompressed_members.append(decompressed)
data = do.unused_data[8:].lstrip(b"\x00")


def main():
Expand Down
7 changes: 7 additions & 0 deletions Lib/test/test_zlib.py
Original file line number Diff line number Diff line change
Expand Up @@ -831,6 +831,13 @@ def test_wbits(self):
dco = zlib.decompressobj(32 + 15)
self.assertEqual(dco.decompress(gzip), HAMLET_SCENE)

for wbits in (-15, 15, 31):
with self.subTest(wbits=wbits):
expected = HAMLET_SCENE
actual = zlib.decompress(
zlib.compress(HAMLET_SCENE, wbits=wbits), wbits=wbits
)
self.assertEqual(expected, actual)

def choose_lines(source, number, seed=None, generator=random):
"""Return a list of number lines randomly chosen from the source"""
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
:func:`zlib.compress` now accepts a wbits parameter which allows users to
compress data as a raw deflate block without zlib headers and trailers in
one go. Previously this required instantiating a ``zlib.compressobj``. It
also provides a faster alternative to ``gzip.compress`` when wbits=31 is
used.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Improve the speed of :func:`gzip.compress` and :func:`gzip.decompress` by
compressing and decompressing at once in memory instead of in a streamed
fashion.
Loading

0 comments on commit ea23e78

Please sign in to comment.