PEP: 524 Title: Make os.urandom() blocking on Linux Version: $Revision$
Last-Modified: $Date$ Author: Victor Stinner <vstinner@python.org>
Status: Final Type: Standards Track Content-Type: text/x-rst Created:
20-Jun-2016 Python-Version: 3.6

Abstract

Modify os.urandom() to block on Linux 3.17 and newer until the OS
urandom is initialized to increase the security.

Add also a new os.getrandom() function (for Linux and Solaris) to be
able to choose how to handle when os.urandom() is going to block on
Linux.

The bug

Original bug

Python 3.5.0 was enhanced to use the new getrandom() syscall introduced
in Linux 3.17 and Solaris 11.3. The problem is that users started to
complain that Python 3.5 blocks at startup on Linux in virtual machines
and embedded devices: see issues #25420 and #26839.

On Linux, getrandom(0) blocks until the kernel initialized urandom with
128 bits of entropy. The issue #25420 describes a Linux build platform
blocking at import random. The issue #26839 describes a short Python
script used to compute a MD5 hash, systemd-cron, script called very
early in the init process. The system initialization blocks on this
script which blocks on getrandom(0) to initialize Python.

The Python initialization requires random bytes to implement a
counter-measure against the hash denial-of-service (hash DoS), see:

-   Issue #13703: Hash collision security issue
-   PEP 456: Secure and interchangeable hash algorithm
    <456>

Importing the random module creates an instance of random.Random:
random._inst. On Python 3.5, random.Random constructor reads 2500 bytes
from os.urandom() to seed a Mersenne Twister RNG (random number
generator).

Other platforms may be affected by this bug, but in practice, only Linux
systems use Python scripts to initialize the system.

Status in Python 3.5.2

Python 3.5.2 behaves like Python 2.7 and Python 3.4. If the system
urandom is not initialized, the startup does not block, but os.urandom()
can return low-quality entropy (even it is not easily guessable).

Use Cases

The following use cases are used to help to choose the right compromise
between security and practicability.

Use Case 1: init script

Use a Python 3 script to initialize the system, like systemd-cron. If
the script blocks, the system initialize is stuck too. The issue #26839
is a good example of this use case.

Use case 1.1: No secret needed

If the init script doesn't have to generate any secure secret, this use
case is already handled correctly in Python 3.5.2: Python startup
doesn't block on system urandom anymore.

Use case 1.2: Secure secret required

If the init script has to generate a secure secret, there is no safe
solution.

Falling back to weak entropy is not acceptable, it would reduce the
security of the program.

Python cannot produce itself secure entropy, it can only wait until
system urandom is initialized. But in this use case, the whole system
initialization is blocked by this script, so the system fails to boot.

The real answer is that the system initialization must not be blocked by
such script. It is ok to start the script very early at system
initialization, but the script may blocked a few seconds until it is
able to generate the secret.

Reminder: in some cases, the initialization of the system urandom never
occurs and so programs waiting for system urandom blocks forever.

Use Case 2: Web server

Run a Python 3 web server serving web pages using HTTP and HTTPS
protocols. The server is started as soon as possible.

The first target of the hash DoS attack was web server: it's important
that the hash secret cannot be easily guessed by an attacker.

If serving a web page needs a secret to create a cookie, create an
encryption key, ..., the secret must be created with good entropy:
again, it must be hard to guess the secret.

A web server requires security. If a choice must be made between
security and running the server with weak entropy, security is more
important. If there is no good entropy: the server must block or fail
with an error.

The question is if it makes sense to start a web server on a host before
system urandom is initialized.

The issues #25420 and #26839 are restricted to the Python startup, not
to generate a secret before the system urandom is initialized.

Fix system urandom

Load entropy from disk at boot

Collecting entropy can take up to several minutes. To accelerate the
system initialization, operating systems store entropy on disk at
shutdown, and then reload entropy from disk at the boot.

If a system collects enough entropy at least once, the system urandom
will be initialized quickly, as soon as the entropy is reloaded from
disk.

Virtual machines

Virtual machines don't have a direct access to the hardware and so have
less sources of entropy than bare metal. A solution is to add a
virtio-rng device to pass entropy from the host to the virtual machine.

Embedded devices

A solution for embedded devices is to plug an hardware RNG.

For example, Raspberry Pi have an hardware RNG but it's not used by
default. See: Hardware RNG on Raspberry Pi.

Denial-of-service when reading random

Don't use /dev/random but /dev/urandom

The /dev/random device should only used for very specific use cases.
Reading from /dev/random on Linux is likely to block. Users don't like
when an application blocks longer than 5 seconds to generate a secret.
It is only expected for specific cases like generating explicitly an
encryption key.

When the system has no available entropy, choosing between blocking
until entropy is available or falling back on lower quality entropy is a
matter of compromise between security and practicability. The choice
depends on the use case.

On Linux, /dev/urandom is secure, it should be used instead of
/dev/random. See Myths about /dev/urandom by Thomas Hühn: "Fact:
/dev/urandom is the preferred source of cryptographic randomness on
UNIX-like systems"

getrandom(size, 0) can block forever on Linux

The origin of the Python issue #26839 is the Debian bug report #822431:
in fact, getrandom(size, 0) blocks forever on the virtual machine. The
system succeeded to boot because systemd killed the blocked process
after 90 seconds.

Solutions like Load entropy from disk at boot reduces the risk of this
bug.

Rationale

On Linux, reading the /dev/urandom can return "weak" entropy before
urandom is fully initialized, before the kernel collected 128 bits of
entropy. Linux 3.17 adds a new getrandom() syscall which allows to block
until urandom is initialized.

On Python 3.5.2, os.urandom() uses the getrandom(size, GRND_NONBLOCK),
but falls back on reading the non-blocking /dev/urandom if
getrandom(size, GRND_NONBLOCK) fails with EAGAIN.

Security experts promotes os.urandom() to generate cryptographic keys
because it is implemented with a Cryptographically secure pseudo-random
number generator (CSPRNG). By the way, os.urandom() is preferred over
ssl.RAND_bytes() for different reasons.

This PEP proposes to modify os.urandom() to use getrandom() in blocking
mode to not return weak entropy, but also ensure that Python will not
block at startup.

Changes

Make os.urandom() blocking on Linux

All changes described in this section are specific to the Linux
platform.

Changes:

-   Modify os.urandom() to block until system urandom is initialized:
    os.urandom() (C function _PyOS_URandom()) is modified to always call
    getrandom(size, 0) (blocking mode) on Linux and Solaris.
-   Add a new private _PyOS_URandom_Nonblocking() function: try to call
    getrandom(size, GRND_NONBLOCK) on Linux and Solaris, but falls back
    on reading /dev/urandom if it fails with EAGAIN.
-   Initialize hash secret from non-blocking system urandom:
    _PyRandom_Init() is modified to call _PyOS_URandom_Nonblocking().
-   random.Random constructor now uses non-blocking system urandom: it
    is modified to use internally the new _PyOS_URandom_Nonblocking()
    function to seed the RNG.

Add a new os.getrandom() function

A new os.getrandom(size, flags=0) function is added: use getrandom()
syscall on Linux and getrandom() C function on Solaris.

The function comes with 2 new flags:

-   os.GRND_RANDOM: read bytes from /dev/random rather than reading
    /dev/urandom
-   os.GRND_NONBLOCK: raise a BlockingIOError if os.getrandom() would
    block

The os.getrandom() is a thin wrapper on the getrandom() syscall/C
function and so inherit of its behaviour. For example, on Linux, it can
return less bytes than requested if the syscall is interrupted by a
signal.

Examples using os.getrandom()

Best-effort RNG

Example of a portable non-blocking RNG function: try to get random bytes
from the OS urandom, or fallback on the random module.

    def best_effort_rng(size):
        # getrandom() is only available on Linux and Solaris
        if not hasattr(os, 'getrandom'):
            return os.urandom(size)

        result = bytearray()
        try:
            # need a loop because getrandom() can return less bytes than
            # requested for different reasons
            while size:
                data = os.getrandom(size, os.GRND_NONBLOCK)
                result += data
                size -= len(data)
        except BlockingIOError:
            # OS urandom is not initialized yet:
            # fallback on the Python random module
            data = bytes(random.randrange(256) for byte in range(size))
            result += data
        return bytes(result)

This function can block in theory on a platform where os.getrandom() is
not available but os.urandom() can block.

wait_for_system_rng()

Example of function waiting timeout seconds until the OS urandom is
initialized on Linux or Solaris:

    def wait_for_system_rng(timeout, interval=1.0):
        if not hasattr(os, 'getrandom'):
            return

        deadline = time.monotonic() + timeout
        while True:
            try:
                os.getrandom(1, os.GRND_NONBLOCK)
            except BlockingIOError:
                pass
            else:
                return

            if time.monotonic() > deadline:
                raise Exception('OS urandom not initialized after %s seconds'
                                % timeout)

            time.sleep(interval)

This function is not portable. For example, os.urandom() can block on
FreeBSD in theory, at the early stage of the system initialization.

Create a best-effort RNG

Simpler example to create a non-blocking RNG on Linux: choose between
Random.SystemRandom and Random.Random depending if getrandom(size) would
block.

    def create_nonblocking_random():
        if not hasattr(os, 'getrandom'):
            return random.Random()

        try:
            os.getrandom(1, os.GRND_NONBLOCK)
        except BlockingIOError:
            return random.Random()
        else:
            return random.SystemRandom()

This function is not portable. For example, random.SystemRandom can
block on FreeBSD in theory, at the early stage of the system
initialization.

Alternative

Leave os.urandom() unchanged, add os.getrandom()

os.urandom() remains unchanged: never block, but it can return weak
entropy if system urandom is not initialized yet.

Only add the new os.getrandom() function (wrapper to the getrandom()
syscall/C function).

The secrets.token_bytes() function should be used to write portable
code.

The problem with this change is that it expects that users understand
well security and know well each platforms. Python has the tradition of
hiding "implementation details". For example, os.urandom() is not a thin
wrapper to the /dev/urandom device: it uses CryptGenRandom() on Windows,
it uses getentropy() on OpenBSD, it tries getrandom() on Linux and
Solaris or falls back on reading /dev/urandom. Python already uses the
best available system RNG depending on the platform.

This PEP does not change the API:

-   os.urandom(), random.SystemRandom and secrets for security
-   random module (except random.SystemRandom) for all other usages

Raise BlockingIOError in os.urandom()

Proposition

PEP 522: Allow BlockingIOError in security sensitive APIs on Linux
<522>.

Python should not decide for the developer how to handle The bug:
raising immediately a BlockingIOError if os.urandom() is going to block
allows developers to choose how to handle this case:

-   catch the exception and falls back to a non-secure entropy source:
    read /dev/urandom on Linux, use the Python random module (which is
    not secure at all), use time, use process identifier, etc.
-   don't catch the error, the whole program fails with this fatal
    exception

More generally, the exception helps to notify when sometimes goes wrong.
The application can emit a warning when it starts to wait for
os.urandom().

Criticism

For the use case 2 (web server), falling back on non-secure entropy is
not acceptable. The application must handle BlockingIOError: poll
os.urandom() until it completes. Example:

    def secret(n=16):
        try:
            return os.urandom(n)
        except BlockingIOError:
            pass

        print("Wait for system urandom initialization: move your "
              "mouse, use your keyboard, use your disk, ...")
        while 1:
            # Avoid busy-loop: sleep 1 ms
            time.sleep(0.001)
            try:
                return os.urandom(n)
            except BlockingIOError:
                pass

For correctness, all applications which must generate a secure secret
must be modified to handle BlockingIOError even if The bug is unlikely.

The case of applications using os.urandom() but don't really require
security is not well defined. Maybe these applications should not use
os.urandom() at the first place, but always the non-blocking random
module. If os.urandom() is used for security, we are back to the use
case 2 described above: Use Case 2: Web server. If a developer doesn't
want to drop os.urandom(), the code should be modified. Example:

    def almost_secret(n=16):
        try:
            return os.urandom(n)
        except BlockingIOError:
            return bytes(random.randrange(256) for byte in range(n))

The question is if The bug is common enough to require that so many
applications have to be modified.

Another simpler choice is to refuse to start before the system urandom
is initialized:

    def secret(n=16):
        try:
            return os.urandom(n)
        except BlockingIOError:
            print("Fatal error: the system urandom is not initialized")
            print("Wait a bit, and rerun the program later.")
            sys.exit(1)

Compared to Python 2.7, Python 3.4 and Python 3.5.2 where os.urandom()
never blocks nor raise an exception on Linux, such behaviour change can
be seen as a major regression.

Add an optional block parameter to os.urandom()

See the issue #27250: Add os.urandom_block().

Add an optional block parameter to os.urandom(). The default value may
be True (block by default) or False (non-blocking).

The first technical issue is to implement os.urandom(block=False) on all
platforms. Only Linux 3.17 (and newer) and Solaris 11.3 (and newer) have
a well defined non-blocking API (getrandom(size, GRND_NONBLOCK)).

As Raise BlockingIOError in os.urandom(), it doesn't seem worth it to
make the API more complex for a theoretical (or at least very rare) use
case.

As Leave os.urandom() unchanged, add os.getrandom(), the problem is that
it makes the API more complex and so more error-prone.

Acceptance

The PEP was accepted on 2016-08-08 by Guido van Rossum.

Annexes

Operating system random functions

os.urandom() uses the following functions:

-   OpenBSD: getentropy() (OpenBSD 5.6)
-   Linux: getrandom() (Linux 3.17) -- see also A system call for random
    numbers: getrandom()
-   Solaris: getentropy(), getrandom() (both need Solaris 11.3)
-   UNIX, BSD: /dev/urandom, /dev/random
-   Windows: CryptGenRandom() (Windows XP)

On Linux, commands to get the status of /dev/random (results are number
of bytes):

    $ cat /proc/sys/kernel/random/entropy_avail
    2850
    $ cat /proc/sys/kernel/random/poolsize
    4096

Why using os.urandom()?

Since os.urandom() is implemented in the kernel, it doesn't have issues
of user-space RNG. For example, it is much harder to get its state. It
is usually built on a CSPRNG, so even if its state is "stolen", it is
hard to compute previously generated numbers. The kernel has a good
knowledge of entropy sources and feed regularly the entropy pool.

That's also why os.urandom() is preferred over ssl.RAND_bytes().

Copyright

This document has been placed in the public domain.