PEP: 330 Title: Python Bytecode Verification Version: $Revision$
Last-Modified: $Date$ Author: Michel Pelletier
<michel@users.sourceforge.net> Status: Rejected Type: Standards Track
Content-Type: text/x-rst Created: 17-Jun-2004 Python-Version: 2.6
Post-History:

Abstract

If Python Virtual Machine (PVM) bytecode is not "well-formed" it is
possible to crash or exploit the PVM by causing various errors such as
under/overflowing the value stack or reading/writing into arbitrary
areas of the PVM program space. Most of these kinds of errors can be
eliminated by verifying that PVM bytecode does not violate a set of
simple constraints before execution.

This PEP proposes a set of constraints on the format and structure of
Python Virtual Machine (PVM) bytecode and provides an implementation in
Python of this verification process.

Pronouncement

Guido believes that a verification tool has some value. If someone wants
to add it to Tools/scripts, no PEP is required.

Such a tool may have value for validating the output from
"bytecodehacks" or from direct edits of PYC files. As security measure,
its value is somewhat limited because perfectly valid bytecode can still
do horrible things. That situation could change if the concept of
restricted execution were to be successfully resurrected.

Motivation

The Python Virtual Machine executes Python programs that have been
compiled from the Python language into a bytecode representation. The
PVM assumes that any bytecode being executed is "well-formed" with
regard to a number implicit constraints. Some of these constraints are
checked at run-time, but most of them are not due to the overhead they
would create.

When running in debug mode the PVM does do several run-time checks to
ensure that any particular bytecode cannot violate these constraints
that, to a degree, prevent bytecode from crashing or exploiting the
interpreter. These checks add a measurable overhead to the interpreter,
and are typically turned off in common use.

Bytecode that is not well-formed and executed by a PVM not running in
debug mode may create a variety of fatal and non-fatal errors.
Typically, ill-formed code will cause the PVM to seg-fault and cause the
OS to immediately and abruptly terminate the interpreter.

Conceivably, ill-formed bytecode could exploit the interpreter and allow
Python bytecode to execute arbitrary C-level machine instructions or to
modify private, internal data structures in the interpreter. If used
cleverly this could subvert any form of security policy an application
may want to apply to its objects.

Practically, it would be difficult for a malicious user to "inject"
invalid bytecode into a PVM for the purposes of exploitation, but not
impossible. Buffer overflow and memory overwrite attacks are commonly
understood, particularly when the exploit payload is transmitted
unencrypted over a network or when a file or network security permission
weakness is used as a foothold for further attacks.

Ideally, no bytecode should ever be allowed to read or write underlying
C-level data structures to subvert the operation of the PVM, whether the
bytecode was maliciously crafted or not. A simple pre-execution
verification step could ensure that bytecode cannot over/underflow the
value stack or access other sensitive areas of PVM program space at
run-time.

This PEP proposes several validation steps that should be taken on
Python bytecode before it is executed by the PVM so that it compiles
with static and structure constraints on its instructions and their
operands. These steps are simple and catch a large class of invalid
bytecode that can cause crashes. There is also some possibility that
some run-time checks can be eliminated up front by a verification pass.

There is, of course, no way to verify that bytecode is "completely
safe", for every definition of complete and safe. Even with bytecode
verification, Python programs can and most likely in the future will
seg-fault for a variety of reasons and continue to cause many different
classes of run-time errors, fatal or not. The verification step proposed
here simply plugs an easy hole that can cause a large class of fatal and
subtle errors at the bytecode level.

Currently, the Java Virtual Machine (JVM) verifies Java bytecode in a
way very similar to what is proposed here. The JVM Specification version
2[1], Sections 4.8 and 4.9 were therefore used as a basis for some of
the constraints explained below. Any Python bytecode verification
implementation at a minimum must enforce these constraints, but may not
be limited to them.

Static Constraints on Bytecode Instructions

1.  The bytecode string must not be empty. (len(co_code) > 0).
2.  The bytecode string cannot exceed a maximum size
    (len(co_code) < sizeof(unsigned char) - 1).
3.  The first instruction in the bytecode string begins at index 0.
4.  Only valid byte-codes with the correct number of operands can be in
    the bytecode string.

Static Constraints on Bytecode Instruction Operands

1.  The target of a jump instruction must be within the code boundaries
    and must fall on an instruction, never between an instruction and
    its operands.
2.  The operand of a LOAD_* instruction must be a valid index into its
    corresponding data structure.
3.  The operand of a STORE_* instruction must be a valid index into its
    corresponding data structure.

Structural Constraints between Bytecode Instructions

1.  Each instruction must only be executed with the appropriate number
    of arguments in the value stack, regardless of the execution path
    that leads to its invocation.
2.  If an instruction can be executed along several different execution
    paths, the value stack must have the same depth prior to the
    execution of the instruction, regardless of the path taken.
3.  At no point during execution can the value stack grow to a depth
    greater than that implied by co_stacksize.
4.  Execution never falls off the bottom of co_code.

Implementation

This PEP is the working document for a Python bytecode verification
implementation written in Python. This implementation is not used
implicitly by the PVM before executing any bytecode, but is to be used
explicitly by users concerned about possibly invalid bytecode with the
following snippet:

    import verify
    verify.verify(object)

The verify module provides a verify function which accepts the same kind
of arguments as dis.dis: classes, methods, functions, or code objects.
It verifies that the object's bytecode is well-formed according to the
specifications of this PEP.

If the code is well-formed the call to verify returns silently without
error. If an error is encountered, it throws a VerificationError whose
argument indicates the cause of the failure. It is up to the programmer
whether or not to handle the error in some way or execute the invalid
code regardless.

Phillip Eby has proposed a pseudo-code algorithm for bytecode stack
depth verification used by the reference implementation.

Verification Issues

This PEP describes only a small number of verifications. While
discussion and analysis will lead to many more, it is highly possible
that future verification may need to be done or custom, project-specific
verifications. For this reason, it might be desirable to add a
verification registration interface to the test implementation to
register future verifiers. The need for this is minimal since custom
verifiers can subclass and extend the current implementation for added
behavior.

Required Changes

Armin Rigo noted that several byte-codes will need modification in order
for their stack effect to be statically analyzed. These are END_FINALLY,
POP_BLOCK, and MAKE_CLOSURE. Armin and Guido have already agreed on how
to correct the instructions. Currently the Python implementation punts
on these instructions.

This PEP does not propose to add the verification step to the
interpreter, but only to provide the Python implementation in the
standard library for optional use. Whether or not this verification
procedure is translated into C, included with the PVM or enforced in any
way is left for future discussion.

References

Copyright

This document has been placed in the public domain.

[1] The Java Virtual Machine Specification 2nd Edition
http://java.sun.com/docs/books/vmspec/2nd-edition/html/ClassFile.doc.html