PEP: 552 Title: Deterministic pycs Version: $Revision$ Last-Modified:
$Date$ Author: Benjamin Peterson <benjamin@python.org> Status: Final
Type: Standards Track Content-Type: text/x-rst Created: 04-Sep-2017
Python-Version: 3.7 Post-History: 07-Sep-2017 Resolution:
https://mail.python.org/pipermail/python-dev/2017-September/149649.html

Abstract

This PEP proposes an extension to the pyc format to make it more
deterministic.

Rationale

A reproducible build is one where the same byte-for-byte output is
generated every time the same sources are built—even across different
machines (naturally subject to the requirement that they have rather
similar environments set up). Reproducibility is important for security.
It is also a key concept in content-based build systems such as Bazel,
which are most effective when the output files’ contents are a
deterministic function of the input files’ contents.

The current Python pyc format is the marshaled code object of the module
prefixed by a magic number, the source timestamp, and the source file
size. The presence of a source timestamp means that a pyc is not a
deterministic function of the input file’s contents—it also depends on
volatile metadata, the mtime of the source. Thus, pycs are a barrier to
proper reproducibility.

Distributors of Python code are currently stuck with the options of

1.  not distributing pycs and losing the caching advantages
2.  distributing pycs and losing reproducibility
3.  carefully giving all Python source files a deterministic timestamp
    (see, for example, https://github.com/python/cpython/pull/296)
4.  doing a complicated mixture of 1. and 2. like generating pycs at
    installation time

None of these options are very attractive. This PEP proposes allowing
the timestamp to be replaced with a deterministic hash. The current
timestamp invalidation method will remain the default, though. Despite
its nondeterminism, timestamp invalidation works well for many workflows
and usecases. The hash-based pyc format can impose the cost of reading
and hashing every source file, which is more expensive than simply
checking timestamps. Thus, for now, we expect it to be used mainly by
distributors and power use cases.

(Note there are other problems[1][2] we do not address here that can
make pycs non-deterministic.)

Specification

The pyc header currently consists of 3 32-bit words. We will expand it
to 4. The first word will continue to be the magic number, versioning
the bytecode and pyc format. The second word, conceptually the new word,
will be a bit field. The interpretation of the rest of the header and
invalidation behavior of the pyc depends on the contents of the bit
field.

If the bit field is 0, the pyc is a traditional timestamp-based pyc.
I.e., the third and forth words will be the timestamp and file size
respectively, and invalidation will be done by comparing the metadata of
the source file with that in the header.

If the lowest bit of the bit field is set, the pyc is a hash-based pyc.
We call the second lowest bit the check_source flag. Following the bit
field is a 64-bit hash of the source file. We will use a SipHash with a
hardcoded key of the contents of the source file. Another fast hash like
MD5 or BLAKE2 would also work. We choose SipHash because Python already
has a builtin implementation of it from PEP 456, although an interface
that allows picking the SipHash key must be exposed to Python. Security
of the hash is not a concern, though we pass over completely-broken
hashes like MD5 to ease auditing of Python in controlled environments.

When Python encounters a hash-based pyc, its behavior depends on the
setting of the check_source flag. If the check_source flag is set,
Python will determine the validity of the pyc by hashing the source file
and comparing the hash with the expected hash in the pyc. If the pyc
needs to be regenerated, it will be regenerated as a hash-based pyc
again with the check_source flag set.

For hash-based pycs with the check_source unset, Python will simply load
the pyc without checking the hash of the source file. The expectation in
this case is that some external system (e.g., the local Linux
distribution’s package manager) is responsible for keeping pycs up to
date, so Python itself doesn’t have to check. Even when validation is
disabled, the hash field should be set correctly, so out-of-band
consistency checkers can verify the up-to-dateness of the pyc. Note also
that the PEP 3147 edict that pycs without corresponding source files not
be loaded will still be enforced for hash-based pycs.

The programmatic APIs of py_compile and compileall will support
generation of hash-based pycs. Principally, py_compile will define a new
enumeration corresponding to all the available pyc invalidation modules:

    class PycInvalidationMode(Enum):
        TIMESTAMP
        CHECKED_HASH
        UNCHECKED_HASH

py_compile.compile, compileall.compile_dir, and compileall.compile_file
will all gain an invalidation_mode parameter, which accepts a value of
the PycInvalidationMode enumeration.

The compileall tool will be extended with a command new option,
--invalidation-mode to generate hash-based pycs with and without the
check_source bit set. --invalidation-mode will be a tristate option
taking values timestamp (the default), checked-hash, and unchecked-hash
corresponding to the values of PycInvalidationMode.

importlib.util will be extended with a source_hash(source) function that
computes the hash used by the pyc writing code for a bytestring source.

Runtime configuration of hash-based pyc invalidation will be facilitated
by a new --check-hash-based-pycs interpreter option. This is a tristate
option, which may take 3 values: default, always, and never. The default
value, default, means the check_source flag in hash-based pycs
determines invalidation as described above. always causes the
interpreter to hash the source file for invalidation regardless of value
of check_source bit. never causes the interpreter to always assume
hash-based pycs are valid. When --check-hash-based-pycs=never is in
effect, unchecked hash-based pycs will be regenerated as unchecked
hash-based pycs. Timestamp-based pycs are unaffected by
--check-hash-based-pycs.

References

Credits

The author would like to thank Gregory P. Smith, Christian Heimes, and
Steve Dower for useful conversations on the topic of this PEP.

Copyright

This document has been placed in the public domain.



  Local Variables: mode: indented-text indent-tabs-mode: nil
  sentence-end-double-space: t fill-column: 80 coding: utf-8 End:

[1] http://benno.id.au/blog/2013/01/15/python-determinism

[2] http://bugzilla.opensuse.org/show_bug.cgi?id=1049186