PEP 784 – Adding Zstandard to the standard library
- Author:
- Emma Harper Smith <emma at python.org>
- Sponsor:
- Gregory P. Smith <greg at krypto.org>
- Discussions-To:
- Discourse thread
- Status:
- Draft
- Type:
- Standards Track
- Created:
- 06-Apr-2025
- Python-Version:
- 3.14
- Post-History:
- 07-Apr-2025
Abstract
Zstandard is a widely adopted, mature, and highly efficient compression
standard. This PEP proposes adding a new module to the Python standard library
containing a Python wrapper around Meta’s zstd
library, the default
implementation. Additionally, to avoid name collisions with packages on PyPI
and to present a unified interface to Python users, compression modules in the
standard library will be moved under a compression.*
package.
Motivation
CPython has modules for several different compression formats, such as
zlib (DEFLATE)
, bzip2
, and lzma
,
each widely used. Including popular compression algorithms matches Python’s
“batteries included” philosophy of incorporating widely useful standards and
utilities. lzma
is the most recent such module, added in Python 3.3.
Since then, Zstandard has become the modern de facto preferred compression library for both high performance compression and decompression attaining high compression ratios at reasonable CPU and memory cost. Zstandard achieves a much higher compression ratio than bzip2 or zlib (DEFLATE) while decompressing significantly faster than LZMA.
Zstandard has seen widespread adoption in many different areas of computing. The numerous hardware implementations demonstrate long-term commitment to Zstandard and an expectation that Zstandard will stay the de facto choice for compression for years to come. This is further evidenced by Zstandard’s IETF standardization in RFC 8478. Zstandard compression is also implemented in both the ZFS and Btrfs filesystems.
Zstandard’s highly efficient compression has supplanted other modern
compression formats, such as brotli, lzo, and ucl due to its highly
efficient compression. While LZ4 is still used in very high throughput
scenarios, Zstandard can also be used in some of these contexts.
While inclusion of LZ4 is out of scope, it would be a compelling future
addition to the compression
namespace introduced by this PEP.
There are several bindings to Zstandard for Python available on PyPI, each with
different APIs and choices of how to bind the zstd
library. One goal with
introducing an official module in the standard library is to reduce confusion
for Python users who want simple compression/decompression APIs for Zstandard.
The existing packages can continue providing extended APIs or integrate
features from newer Zstandard versions.
Another reason to add Zstandard support to the standard library is to resolve
a long standing open issue (python/cpython#81276) requesting Zstandard
support in the tarfile
module. This issue has the 5th most “thumbs up”
of open issues on the CPython tracker, and has garnered a significant amount of
discussion and interest. Additionally, the ZIP format standardizes a
Zstandard compression format ID, and integration with the zipfile
module would allow opening ZIP archives using Zstandard compression. The
reference implementation for this PEP contains integration with the
zipfile
, tarfile
, and shutil
modules.
Zstandard compression could also be used to make Python wheel packages smaller and significantly faster to install. Anaconda found a sizeable speedup when adopting Zstandard for the conda package format:
Conda’s download sizes are reduced ~30-40%, and extraction is dramatically faster. […] We see approximately a 2.5x overall speedup, almost all thanks to the dramatically faster extraction speed of the zstd compression used in the new file format.
Zstandard has a significantly higher compression ratio compared to wheel’s existing zlib-based compression, according to lzbench, a comprehensive benchmark of many different compression libraries and formats. While this PEP does not prescribe any changes to the wheel format or other packaging standards, having Zstandard bindings in the standard library would enable a future PEP to improve the user experience for Python wheel packages.
Rationale
Introduction of a compression
package
Both the zstd
and zstandard
import names are claimed by projects on
PyPI. To avoid breaking users of one of the existing bindings, this PEP
proposes introducing a new namespace for compression libraries,
compression
. This name is already reserved on PyPI for use in the
standard library. The new Zstandard module will be compression.zstd
.
Other compression modules will be re-exported to the compression
namespace
and their current import names will be deprecated.
Providing a common namespace for compression modules has several advantages.
First, it reduces user confusion about where to find compression modules.
Second, the top level compression
module could provide information on which
compression formats are available, similar to hashlib
’s
algorithms_available
. If PEP 775 is accepted, a
compression.algorithms_guaranteed
could be provided as well, listing
zlib
. Finally, a compression
namespace prevents future issues with
merging other compression formats into the standard library. New compression
formats will likely be published to PyPI prior to integration into
CPython. Therefore, any new compression format import name will likely already
be claimed by the time a module would be considered for inclusion in CPython.
Putting compression modules under a package prefix prevents issues with
potential future name clashes.
Code that would like to remain compatible across Python versions may use the following pattern to ensure compatibility:
try:
from compression.lzma import LZMAFile
except ImportError:
from lzma import LZMAFile
This will use the newer import name when available and fall back to the old name otherwise.
Implementation based on pyzstd
The implementation for this PEP is based on the pyzstd project.
This project was chosen as the code was originally written to be upstreamed
to CPython by Ma Lin, who also wrote the output buffer implementation used in
the standard library today.
The project has since been taken over by Rogdham and is published to PyPI. The
APIs in pyzstd
are similar to the APIs for other compression modules in the
standard library such as bz2
and lzma
.
Minimum supported Zstandard version
The minimum supported Zstandard was chosen as v1.4.5, released in May of 2020. This version was chosen as a minimum based on reviewing the versions of Zstandard available in a number of Linux distribution package repositories, including LTS releases. This version choice is rather conservative to maximize compatibility with existing LTS Linux distributions, but a newer Zstandard version could likely be chosen given that newer Python releases are generally packaged as part of newer distribution releases.
Specification
The compression
namespace
A new namespace for compression modules will be introduced named
compression
. The top-level module for this package will be empty to begin
with, but a standard API for interacting with compression routines may be
added in the future to the toplevel.
The compression.zstd
module
A new module, compression.zstd
will be introduced with Zstandard
compression APIs that match other compression modules in the standard library,
namely
compress()
/decompress()
- APIs for one-shot compression or decompressionZstdFile
/open()
- APIs for interacting with streams and file-like objectsZstdCompressor
/ZstdDecompressor
- APIs for incremental compression or decompression
It will also contain some Zstandard-specific functionality:
ZstdDict
/train_dict()
/finalize_dict()
- APIs for interacting with Zstandard dictionaries, which are useful for compressing many small chunks of similar data
libzstd
optional dependency
The libzstd
library will become an optional dependency of CPython. If the
library is not available, the compression.zstd
module will be unavailable.
This is handled automatically on Unix platforms as part of the normal build
environment detection.
On Windows, libzstd
will be added to
the source dependencies
used to build libraries CPython depends on for Windows.
Other compression modules
New import names compression.lzma
, compression.bz2
, and
compression.zlib
will be introduced in Python 3.14 re-exporting the
contents of the existing lzma
, bz2
, and zlib
modules respectively.
The _compression
module, given that it is marked private, will be
immediately renamed to compression._common.streams
. The new name was
selected due to the current contents of the module being I/O related helpers
for stream APIs (e.g. LZMAFile
) in standard library compression modules.
Compression module migration timeline
Existing modules will emit a DeprecationWarning
in the Python
release following the last Python without the compression
module leaving
support. For example, if the compression
namespace is introduced in 3.14,
then the DeprecationWarnings
would be emitted in 3.19, the next release
after 3.13 reaches end of life. These warnings would begin five years after the
introduction of compression
namespace. In accordance with PEP 387, in
Python 3.24, five years after the DeprecationWarnings
are added and ten
years after the new compression
namespace is introduced, the existing
modules will be removed and code must use the compression
sub-modules. The
documentation for these modules will be updated to discuss the planned
deprecation and removal timelines.
Backwards Compatibility
The main compatibility concern is usage of existing standard library compression APIs with the existing import names. These names will be deprecated in 3.19 and will be removed in 3.24. Given the long coexistence of the modules and a 5 year deprecation period, most users will likely migrate to the new import names well before then. Additionally, a libCST codemod can be provided to automatically rewrite imports, easing the migration.
Security Implications
As with any new C code, especially code operating on potentially untrusted user
input, there are risks of memory safety issues. The author plans on
contributing integration with libfuzzer to enable fuzzing the _zstd
code
and ensure it is robust. Furthermore, there are a number of tests that exercise
the compression and decompression routines. These tests pass without error when
compiled with AddressSanitizer.
Taking on a new dependency also always has security risks, but the zstd
library is mature, fuzzed on each commit, and participates in Meta’s bug bounty
program.
How to Teach This
Documentation for the new module is in the reference implementation branch. The documentation for other modules will be updated to discuss the deprecation of their existing import names, and how to migrate.
Reference Implementation
The reference implementation
contains the _zstd
C code, the compression.zstd
code, modifications to
tarfile
, shutil
, and zipfile
, and tests for each new API and
integration added. It also contains the re-exports of other compression
modules. Deprecations for the existing import names will be added once a
decision is reached regarding the open issues.
Rejected Ideas
Name the module libzstd
and do not make a new compression
namespace
One option instead of making a new compression
namespace would be to find
a different name, such as libzstd
, as the import name. However, the issue
of existing import names is likely to persist for future compression formats
added to the standard library. LZ4, a common high speed compression format,
has a package on PyPI, lz4
, with the
import name lz4
. Instead of solving this issue for each compression format,
it is better to solve it once and for all by using the already-claimed
compression
namespace.
Copyright
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
Source: https://github.com/python/peps/blob/main/peps/pep-0784.rst
Last modified: 2025-04-08 04:15:03 GMT