PEP: 488 Title: Elimination of PYO files Version: $Revision$
Last-Modified: $Date$ Author: Brett Cannon <brett@python.org> Status:
Final Type: Standards Track Content-Type: text/x-rst Created:
20-Feb-2015 Python-Version: 3.5 Post-History: 06-Mar-2015, 13-Mar-2015,
20-Mar-2015

Abstract

This PEP proposes eliminating the concept of PYO files from Python. To
continue the support of the separation of bytecode files based on their
optimization level, this PEP proposes extending the PYC file name to
include the optimization level in the bytecode repository directory when
there are optimizations applied.

Rationale

As of today, bytecode files come in two flavours: PYC and PYO. A PYC
file is the bytecode file generated and read from when no optimization
level is specified at interpreter startup (i.e., -O is not specified). A
PYO file represents the bytecode file that is read/written when any
optimization level is specified (i.e., when -O or -OO is specified).
This means that while PYC files clearly delineate the optimization level
used when they were generated -- namely no optimizations beyond the
peepholer -- the same is not true for PYO files. To put this in terms of
optimization levels and the file extension:

-   0: .pyc
-   1 (-O): .pyo
-   2 (-OO): .pyo

The reuse of the .pyo file extension for both level 1 and 2
optimizations means that there is no clear way to tell what optimization
level was used to generate the bytecode file. In terms of reading PYO
files, this can lead to an interpreter using a mixture of optimization
levels with its code if the user was not careful to make sure all PYO
files were generated using the same optimization level (typically done
by blindly deleting all PYO files and then using the compileall module
to compile all-new PYO files[1]). This issue is only compounded when
people optimize Python code beyond what the interpreter natively
supports, e.g., using the astoptimizer project[2].

In terms of writing PYO files, the need to delete all PYO files every
time one either changes the optimization level they want to use or are
unsure of what optimization was used the last time PYO files were
generated leads to unnecessary file churn. The change proposed by this
PEP also allows for all optimization levels to be pre-compiled for
bytecode files ahead of time, something that is currently impossible
thanks to the reuse of the .pyo file extension for multiple optimization
levels.

As for distributing bytecode-only modules, having to distribute both
.pyc and .pyo files is unnecessary for the common use-case of code
obfuscation and smaller file deployments. This means that bytecode-only
modules will only load from their non-optimized .pyc file name.

Proposal

To eliminate the ambiguity that PYO files present, this PEP proposes
eliminating the concept of PYO files and their accompanying .pyo file
extension. To allow for the optimization level to be unambiguous as well
as to avoid having to regenerate optimized bytecode files needlessly in
the __pycache__ directory, the optimization level used to generate the
bytecode file will be incorporated into the bytecode file name. When no
optimization level is specified, the pre-PEP .pyc file name will be used
(i.e., no optimization level will be specified in the file name). For
example, a source file named foo.py in CPython 3.5 could have the
following bytecode files based on the interpreter's optimization level
(none, -O, and -OO):

-   0: foo.cpython-35.pyc (i.e., no change)
-   1: foo.cpython-35.opt-1.pyc
-   2: foo.cpython-35.opt-2.pyc

Currently bytecode file names are created by
importlib.util.cache_from_source(), approximately using the following
expression defined by PEP 3147[3],[4]:

    '{name}.{cache_tag}.pyc'.format(name=module_name,
                                    cache_tag=sys.implementation.cache_tag)

This PEP proposes to change the expression when an optimization level is
specified to:

    '{name}.{cache_tag}.opt-{optimization}.pyc'.format(
            name=module_name,
            cache_tag=sys.implementation.cache_tag,
            optimization=str(sys.flags.optimize))

The "opt-" prefix was chosen so as to provide a visual separator from
the cache tag. The placement of the optimization level after the cache
tag was chosen to preserve lexicographic sort order of bytecode file
names based on module name and cache tag which will not vary for a
single interpreter. The "opt-" prefix was chosen over "o" so as to be
somewhat self-documenting. The "opt-" prefix was chosen over "O" so as
to not have any confusion in case "0" was the leading prefix of the
optimization level.

A period was chosen over a hyphen as a separator so as to distinguish
clearly that the optimization level is not part of the interpreter
version as specified by the cache tag. It also lends to the use of the
period in the file name to delineate semantically different concepts.

For example, if -OO had been passed to the interpreter then instead of
importlib.cpython-35.pyo the file name would be
importlib.cpython-35.opt-2.pyc.

Leaving out the new opt- tag when no optimization level is applied
should increase backwards-compatibility. This is also more understanding
of Python implementations which have no use for optimization levels
(e.g., PyPy[5]).

It should be noted that this change in no way affects the performance of
import. Since the import system looks for a single bytecode file based
on the optimization level of the interpreter already and generates a new
bytecode file if it doesn't exist, the introduction of potentially more
bytecode files in the __pycache__ directory has no effect in terms of
stat calls. The interpreter will continue to look for only a single
bytecode file based on the optimization level and thus no increase in
stat calls will occur.

The only potentially negative result of this PEP is the probable
increase in the number of .pyc files and thus increase in storage use.
But for platforms where this is an issue, sys.dont_write_bytecode exists
to turn off bytecode generation so that it can be controlled offline.

Implementation

An implementation of this PEP is available[6].

importlib

As importlib.util.cache_from_source() is the API that exposes bytecode
file paths as well as being directly used by importlib, it requires the
most critical change. As of Python 3.4, the function's signature is:

    importlib.util.cache_from_source(path, debug_override=None)

This PEP proposes changing the signature in Python 3.5 to:

    importlib.util.cache_from_source(path, debug_override=None, *, optimization=None)

The introduced optimization keyword-only parameter will control what
optimization level is specified in the file name. If the argument is
None then the current optimization level of the interpreter will be
assumed (including no optimization). Any argument given for optimization
will be passed to str() and must have str.isalnum() be true, else
ValueError will be raised (this prevents invalid characters being used
in the file name). If the empty string is passed in for optimization
then the addition of the optimization will be suppressed, reverting to
the file name format which predates this PEP.

It is expected that beyond Python's own two optimization levels,
third-party code will use a hash of optimization names to specify the
optimization level, e.g.
hashlib.sha256(','.join(['no dead code', 'const folding'])).hexdigest().
While this might lead to long file names, it is assumed that most users
never look at the contents of the __pycache__ directory and so this
won't be an issue.

The debug_override parameter will be deprecated. A False value will be
equivalent to optimization=1 while a True value will represent
optimization='' (a None argument will continue to mean the same as for
optimization). A deprecation warning will be raised when debug_override
is given a value other than None, but there are no plans for the
complete removal of the parameter at this time (but removal will be no
later than Python 4).

The various module attributes for importlib.machinery which relate to
bytecode file suffixes will be updated[7]. The DEBUG_BYTECODE_SUFFIXES
and OPTIMIZED_BYTECODE_SUFFIXES will both be documented as deprecated
and set to the same value as BYTECODE_SUFFIXES (removal of
DEBUG_BYTECODE_SUFFIXES and OPTIMIZED_BYTECODE_SUFFIXES is not currently
planned, but will be not later than Python 4).

All various finders and loaders will also be updated as necessary, but
updating the previous mentioned parts of importlib should be all that is
required.

Rest of the standard library

The various functions exposed by the py_compile and compileall functions
will be updated as necessary to make sure they follow the new bytecode
file name semantics[8],[9]. The CLI for the compileall module will not
be directly affected (the -b flag will be implicit as it will no longer
generate .pyo files when -O is specified).

Compatibility Considerations

Any code directly manipulating bytecode files from Python 3.2 on will
need to consider the impact of this change on their code (prior to
Python 3.2 -- including all of Python 2 -- there was no __pycache__
which already necessitates bifurcating bytecode file handling support).
If code was setting the debug_override argument to
importlib.util.cache_from_source() then care will be needed if they want
the path to a bytecode file with an optimization level of 2. Otherwise
only code not using importlib.util.cache_from_source() will need
updating.

As for people who distribute bytecode-only modules (i.e., use a bytecode
file instead of a source file), they will have to choose which
optimization level they want their bytecode files to be since
distributing a .pyo file with a .pyc file will no longer be of any use.
Since people typically only distribute bytecode files for code
obfuscation purposes or smaller distribution size then only having to
distribute a single .pyc should actually be beneficial to these
use-cases. And since the magic number for bytecode files changed in
Python 3.5 to support PEP 465 there is no need to support pre-existing
.pyo files[10].

Rejected Ideas

Completely dropping optimization levels from CPython

Some have suggested that instead of accommodating the various
optimization levels in CPython, we should instead drop them entirely.
The argument is that significant performance gains would occur from
runtime optimizations through something like a JIT and not through
pre-execution bytecode optimizations.

This idea is rejected for this PEP as that ignores the fact that there
are people who do find the pre-existing optimization levels for CPython
useful. It also assumes that no other Python interpreter would find what
this PEP proposes useful.

Alternative formatting of the optimization level in the file name

Using the "opt-" prefix and placing the optimization level between the
cache tag and file extension is not critical. All options which have
been considered are:

-   importlib.cpython-35.opt-1.pyc
-   importlib.cpython-35.opt1.pyc
-   importlib.cpython-35.o1.pyc
-   importlib.cpython-35.O1.pyc
-   importlib.cpython-35.1.pyc
-   importlib.cpython-35-O1.pyc
-   importlib.O1.cpython-35.pyc
-   importlib.o1.cpython-35.pyc
-   importlib.1.cpython-35.pyc

These were initially rejected either because they would change the sort
order of bytecode files, possible ambiguity with the cache tag, or were
not self-documenting enough. An informal poll was taken and people
clearly preferred the formatting proposed by the PEP[11]. Since this
topic is non-technical and of personal choice, the issue is considered
solved.

Embedding the optimization level in the bytecode metadata

Some have suggested that rather than embedding the optimization level of
bytecode in the file name that it be included in the file's metadata
instead. This would mean every interpreter had a single copy of bytecode
at any time. Changing the optimization level would thus require
rewriting the bytecode, but there would also only be a single file to
care about.

This has been rejected due to the fact that Python is often installed as
a root-level application and thus modifying the bytecode file for
modules in the standard library are always possible. In this situation
integrators would need to guess at what a reasonable optimization level
was for users for any/all situations. By allowing multiple optimization
levels to co-exist simultaneously it frees integrators from having to
guess what users want and allows users to utilize the optimization level
they want.

References

Copyright

This document has been placed in the public domain.

[1] The compileall module
(https://docs.python.org/3.5/library/compileall.html)

[2] The astoptimizer project
(https://web.archive.org/web/20150909225454/https://pypi.python.org/pypi/astoptimizer)

[3] importlib.util.cache_from_source()
(https://docs.python.org/3.5/library/importlib.html#importlib.util.cache_from_source)

[4] Implementation of importlib.util.cache_from_source() from CPython
3.4.3rc1
(https://github.com/python/cpython/blob/e55181f517bbfc875065ce86ed3e05cf0e0246fa/Lib/importlib/_bootstrap.py#L437)

[5] The PyPy Project (https://www.pypy.org/)

[6] Implementation of PEP 488
(https://github.com/python/cpython/issues/67919)

[7] The importlib.machinery module
(https://docs.python.org/3.5/library/importlib.html#module-importlib.machinery)

[8] The py_compile module
(https://docs.python.org/3.5/library/compileall.html)

[9] The compileall module
(https://docs.python.org/3.5/library/compileall.html)

[10] importlib.util.MAGIC_NUMBER
(https://docs.python.org/3.5/library/importlib.html#importlib.util.MAGIC_NUMBER)

[11] Informal poll of file name format options on Google+
(https://web.archive.org/web/20160925163500/https://plus.google.com/+BrettCannon/posts/fZynLNwHWGm)