Following system colour scheme Selected dark colour scheme Selected light colour scheme

Python Enhancement Proposals

PEP 784 – Adding Zstandard to the standard library

Author:
Emma Harper Smith <emma at python.org>
Sponsor:
Gregory P. Smith <greg at krypto.org>
Discussions-To:
Discourse thread
Status:
Draft
Type:
Standards Track
Created:
06-Apr-2025
Python-Version:
3.14
Post-History:
07-Apr-2025

Table of Contents

Abstract

Zstandard is a widely adopted, mature, and highly efficient compression standard. This PEP proposes adding a new module to the Python standard library containing a Python wrapper around Meta’s zstd library, the default implementation. Additionally, to avoid name collisions with packages on PyPI and to present a unified interface to Python users, compression modules in the standard library will be moved under a compression.* package.

Motivation

CPython has modules for several different compression formats, such as zlib (DEFLATE), bzip2, and lzma, each widely used. Including popular compression algorithms matches Python’s “batteries included” philosophy of incorporating widely useful standards and utilities. lzma is the most recent such module, added in Python 3.3.

Since then, Zstandard has become the modern de facto preferred compression library for both high performance compression and decompression attaining high compression ratios at reasonable CPU and memory cost. Zstandard achieves a much higher compression ratio than bzip2 or zlib (DEFLATE) while decompressing significantly faster than LZMA.

Zstandard has seen widespread adoption in many different areas of computing. The numerous hardware implementations demonstrate long-term commitment to Zstandard and an expectation that Zstandard will stay the de facto choice for compression for years to come. This is further evidenced by Zstandard’s IETF standardization in RFC 8478. Zstandard compression is also implemented in both the ZFS and Btrfs filesystems.

Zstandard’s highly efficient compression has supplanted other modern compression formats, such as brotli, lzo, and ucl due to its highly efficient compression. While LZ4 is still used in very high throughput scenarios, Zstandard can also be used in some of these contexts. While inclusion of LZ4 is out of scope, it would be a compelling future addition to the compression namespace introduced by this PEP.

There are several bindings to Zstandard for Python available on PyPI, each with different APIs and choices of how to bind the zstd library. One goal with introducing an official module in the standard library is to reduce confusion for Python users who want simple compression/decompression APIs for Zstandard. The existing packages can continue providing extended APIs or integrate features from newer Zstandard versions.

Another reason to add Zstandard support to the standard library is to resolve a long standing open issue (python/cpython#81276) requesting Zstandard support in the tarfile module. This issue has the 5th most “thumbs up” of open issues on the CPython tracker, and has garnered a significant amount of discussion and interest. Additionally, the ZIP format standardizes a Zstandard compression format ID, and integration with the zipfile module would allow opening ZIP archives using Zstandard compression. The reference implementation for this PEP contains integration with the zipfile, tarfile, and shutil modules.

Zstandard compression could also be used to make Python wheel packages smaller and significantly faster to install. Anaconda found a sizeable speedup when adopting Zstandard for the conda package format:

Conda’s download sizes are reduced ~30-40%, and extraction is dramatically faster. […] We see approximately a 2.5x overall speedup, almost all thanks to the dramatically faster extraction speed of the zstd compression used in the new file format.

Anaconda blog on Zstandard adoption

Zstandard has a significantly higher compression ratio compared to wheel’s existing zlib-based compression, according to lzbench, a comprehensive benchmark of many different compression libraries and formats. While this PEP does not prescribe any changes to the wheel format or other packaging standards, having Zstandard bindings in the standard library would enable a future PEP to improve the user experience for Python wheel packages.

Rationale

Introduction of a compression package

Both the zstd and zstandard import names are claimed by projects on PyPI. To avoid breaking users of one of the existing bindings, this PEP proposes introducing a new namespace for compression libraries, compression. This name is already reserved on PyPI for use in the standard library. The new Zstandard module will be compression.zstd. Other compression modules will be re-exported to the compression namespace and their current import names will be deprecated.

Providing a common namespace for compression modules has several advantages. First, it reduces user confusion about where to find compression modules. Second, the top level compression module could provide information on which compression formats are available, similar to hashlib’s algorithms_available. If PEP 775 is accepted, a compression.algorithms_guaranteed could be provided as well, listing zlib. Finally, a compression namespace prevents future issues with merging other compression formats into the standard library. New compression formats will likely be published to PyPI prior to integration into CPython. Therefore, any new compression format import name will likely already be claimed by the time a module would be considered for inclusion in CPython. Putting compression modules under a package prefix prevents issues with potential future name clashes.

Code that would like to remain compatible across Python versions may use the following pattern to ensure compatibility:

try:
    from compression.lzma import LZMAFile
except ImportError:
    from lzma import LZMAFile

This will use the newer import name when available and fall back to the old name otherwise.

Implementation based on pyzstd

The implementation for this PEP is based on the pyzstd project. This project was chosen as the code was originally written to be upstreamed to CPython by Ma Lin, who also wrote the output buffer implementation used in the standard library today. The project has since been taken over by Rogdham and is published to PyPI. The APIs in pyzstd are similar to the APIs for other compression modules in the standard library such as bz2 and lzma.

Minimum supported Zstandard version

The minimum supported Zstandard was chosen as v1.4.5, released in May of 2020. This version was chosen as a minimum based on reviewing the versions of Zstandard available in a number of Linux distribution package repositories, including LTS releases. This version choice is rather conservative to maximize compatibility with existing LTS Linux distributions, but a newer Zstandard version could likely be chosen given that newer Python releases are generally packaged as part of newer distribution releases.

Specification

The compression namespace

A new namespace for compression modules will be introduced named compression. The top-level module for this package will be empty to begin with, but a standard API for interacting with compression routines may be added in the future to the toplevel.

The compression.zstd module

A new module, compression.zstd will be introduced with Zstandard compression APIs that match other compression modules in the standard library, namely

  • compress() / decompress() - APIs for one-shot compression or decompression
  • ZstdFile / open() - APIs for interacting with streams and file-like objects
  • ZstdCompressor / ZstdDecompressor - APIs for incremental compression or decompression

It will also contain some Zstandard-specific functionality:

  • ZstdDict / train_dict() / finalize_dict() - APIs for interacting with Zstandard dictionaries, which are useful for compressing many small chunks of similar data

libzstd optional dependency

The libzstd library will become an optional dependency of CPython. If the library is not available, the compression.zstd module will be unavailable. This is handled automatically on Unix platforms as part of the normal build environment detection.

On Windows, libzstd will be added to the source dependencies used to build libraries CPython depends on for Windows.

Other compression modules

New import names compression.lzma, compression.bz2, and compression.zlib will be introduced in Python 3.14 re-exporting the contents of the existing lzma, bz2, and zlib modules respectively.

The _compression module, given that it is marked private, will be immediately renamed to compression._common.streams. The new name was selected due to the current contents of the module being I/O related helpers for stream APIs (e.g. LZMAFile) in standard library compression modules.

Compression module migration timeline

Existing modules will emit a DeprecationWarning in the Python release following the last Python without the compression module leaving support. For example, if the compression namespace is introduced in 3.14, then the DeprecationWarnings would be emitted in 3.19, the next release after 3.13 reaches end of life. These warnings would begin five years after the introduction of compression namespace. In accordance with PEP 387, in Python 3.24, five years after the DeprecationWarnings are added and ten years after the new compression namespace is introduced, the existing modules will be removed and code must use the compression sub-modules. The documentation for these modules will be updated to discuss the planned deprecation and removal timelines.

Backwards Compatibility

The main compatibility concern is usage of existing standard library compression APIs with the existing import names. These names will be deprecated in 3.19 and will be removed in 3.24. Given the long coexistence of the modules and a 5 year deprecation period, most users will likely migrate to the new import names well before then. Additionally, a libCST codemod can be provided to automatically rewrite imports, easing the migration.

Security Implications

As with any new C code, especially code operating on potentially untrusted user input, there are risks of memory safety issues. The author plans on contributing integration with libfuzzer to enable fuzzing the _zstd code and ensure it is robust. Furthermore, there are a number of tests that exercise the compression and decompression routines. These tests pass without error when compiled with AddressSanitizer.

Taking on a new dependency also always has security risks, but the zstd library is mature, fuzzed on each commit, and participates in Meta’s bug bounty program.

How to Teach This

Documentation for the new module is in the reference implementation branch. The documentation for other modules will be updated to discuss the deprecation of their existing import names, and how to migrate.

Reference Implementation

The reference implementation contains the _zstd C code, the compression.zstd code, modifications to tarfile, shutil, and zipfile, and tests for each new API and integration added. It also contains the re-exports of other compression modules. Deprecations for the existing import names will be added once a decision is reached regarding the open issues.

Rejected Ideas

Name the module libzstd and do not make a new compression namespace

One option instead of making a new compression namespace would be to find a different name, such as libzstd, as the import name. However, the issue of existing import names is likely to persist for future compression formats added to the standard library. LZ4, a common high speed compression format, has a package on PyPI, lz4, with the import name lz4. Instead of solving this issue for each compression format, it is better to solve it once and for all by using the already-claimed compression namespace.


Source: https://github.com/python/peps/blob/main/peps/pep-0784.rst

Last modified: 2025-04-08 04:15:03 GMT