PEP: 456 Title: Secure and interchangeable hash algorithm Version:
$Revision$ Last-Modified: $Date$ Author: Christian Heimes
<christian@python.org> BDFL-Delegate: Alyssa Coghlan Status: Final Type:
Standards Track Content-Type: text/x-rst Created: 27-Sep-2013
Python-Version: 3.4 Post-History: 06-Oct-2013, 14-Nov-2013, 20-Nov-2013
Resolution:
https://mail.python.org/pipermail/python-dev/2013-November/130400.html

Abstract

This PEP proposes SipHash as default string and bytes hash algorithm to
properly fix hash randomization once and for all. It also proposes
modifications to Python's C code in order to unify the hash code and to
make it easily interchangeable.

Rationale

Despite the last attempt [issue13703] CPython is still vulnerable to
hash collision DoS attacks [29c3] [issue14621]. The current hash
algorithm and its randomization is not resilient against attacks. Only a
proper cryptographic hash function prevents the extraction of secret
randomization keys. Although no practical attack against a Python-based
service has been seen yet, the weakness has to be fixed. Jean-Philippe
Aumasson and Daniel J. Bernstein have already shown how the seed for the
current implementation can be recovered [poc].

Furthermore, the current hash algorithm is hard-coded and implemented
multiple times for bytes and three different Unicode representations
UCS1, UCS2 and UCS4. This makes it impossible for embedders to replace
it with a different implementation without patching and recompiling
large parts of the interpreter. Embedders may want to choose a more
suitable hash function.

Finally the current implementation code does not perform well. In the
common case it only processes one or two bytes per cycle. On a modern
64-bit processor the code can easily be adjusted to deal with eight
bytes at once.

This PEP proposes three major changes to the hash code for strings and
bytes:

-   SipHash [sip] is introduced as default hash algorithm. It is fast
    and small despite its cryptographic properties. Due to the fact that
    it was designed by well known security and crypto experts, it is
    safe to assume that its secure for the near future.
-   The existing FNV code is kept for platforms without a 64-bit data
    type. The algorithm is optimized to process larger chunks per cycle.
-   Calculation of the hash of strings and bytes is moved into a single
    API function instead of multiple specialized implementations in
    Objects/object.c and Objects/unicodeobject.c. The function takes a
    void pointer plus length and returns the hash for it.
-   The algorithm can be selected at compile time. FNV is guaranteed to
    exist on all platforms. SipHash is available on the majority of
    modern systems.

Requirements for a hash function

-   It MUST be able to hash arbitrarily large blocks of memory from 1
    byte up to the maximum ssize_t value.
-   It MUST produce at least 32 bits on 32-bit platforms and at least 64
    bits on 64-bit platforms. (Note: Larger outputs can be compressed
    with e.g. v ^ (v >> 32).)
-   It MUST support hashing of unaligned memory in order to support
    hash(memoryview).
-   It is highly RECOMMENDED that the length of the input influences the
    outcome, so that hash(b'\00') != hash(b'\x00\x00').

The internal interface code between the hash function and the tp_hash
slots implements special cases for zero length input and a return value
of -1. An input of length 0 is mapped to hash value 0. The output -1 is
mapped to -2.

Current implementation with modified FNV

CPython currently uses a variant of the Fowler-Noll-Vo hash function
[fnv]. The variant is has been modified to reduce the amount and cost of
hash collisions for common strings. The first character of the string is
added twice, the first time with a bit shift of 7. The length of the
input string is XOR-ed to the final value. Both deviations from the
original FNV algorithm reduce the amount of hash collisions for short
strings.

Recently [issue13703] a random prefix and suffix were added as an
attempt to randomize the hash values. In order to protect the hash
secret the code still returns 0 for zero length input.

C code:

    Py_uhash_t x;
    Py_ssize_t len;
    /* p is either 1, 2 or 4 byte type */
    unsigned char *p;
    Py_UCS2 *p;
    Py_UCS4 *p;

    if (len == 0)
        return 0;
    x = (Py_uhash_t) _Py_HashSecret.prefix;
    x ^= (Py_uhash_t) *p << 7;
    for (i = 0; i < len; i++)
        x = (1000003 * x) ^ (Py_uhash_t) *p++;
    x ^= (Py_uhash_t) len;
    x ^= (Py_uhash_t) _Py_HashSecret.suffix;
    return x;

Which roughly translates to Python:

    def fnv(p):
        if len(p) == 0:
            return 0

        # bit mask, 2**32-1 or 2**64-1
        mask = 2 * sys.maxsize + 1

        x = hashsecret.prefix
        x = (x ^ (ord(p[0]) << 7)) & mask
        for c in p:
            x = ((1000003 * x) ^ ord(c)) & mask
        x = (x ^ len(p)) & mask
        x = (x ^ hashsecret.suffix) & mask

        if x == -1:
            x = -2

        return x

FNV is a simple multiply and XOR algorithm with no cryptographic
properties. The randomization was not part of the initial hash code, but
was added as counter measure against hash collision attacks as explained
in oCERT-2011-003 [ocert]. Because FNV is not a cryptographic hash
algorithm and the dict implementation is not fortified against side
channel analysis, the randomization secrets can be calculated by a
remote attacker. The author of this PEP strongly believes that the
nature of a non-cryptographic hash function makes it impossible to
conceal the secrets.

Examined hashing algorithms

The author of this PEP has researched several hashing algorithms that
are considered modern, fast and state-of-the-art.

SipHash

SipHash [sip] is a cryptographic pseudo random function with a 128-bit
seed and 64-bit output. It was designed by Jean-Philippe Aumasson and
Daniel J. Bernstein as a fast and secure keyed hash algorithm. It's used
by Ruby, Perl, OpenDNS, Rust, Redis, FreeBSD and more. The C reference
implementation has been released under CC0 license (public domain).

Quote from SipHash's site:

  SipHash is a family of pseudorandom functions (a.k.a. keyed hash
  functions) optimized for speed on short messages. Target applications
  include network traffic authentication and defense against
  hash-flooding DoS attacks.

siphash24 is the recommend variant with best performance. It uses 2
rounds per message block and 4 finalization rounds. Besides the
reference implementation several other implementations are available.
Some are single-shot functions, others use a Merkle–Damgård
construction-like approach with init, update and finalize functions.
Marek Majkowski C implementation csiphash [csiphash] defines the
prototype of the function. (Note: k is split up into two uint64_t):

    uint64_t siphash24(const void *src, unsigned long src_sz, const char k[16])

SipHash requires a 64-bit data type and is not compatible with pure C89
platforms.

MurmurHash

MurmurHash [murmur] is a family of non-cryptographic keyed hash function
developed by Austin Appleby. Murmur3 is the latest and fast variant of
MurmurHash. The C++ reference implementation has been released into
public domain. It features 32- or 128-bit output with a 32-bit seed.
(Note: The out parameter is a buffer with either 1 or 4 bytes.)

Murmur3's function prototypes are:

    void MurmurHash3_x86_32(const void *key, int len, uint32_t seed, void *out)

    void MurmurHash3_x86_128(const void *key, int len, uint32_t seed, void *out)

    void MurmurHash3_x64_128(const void *key, int len, uint32_t seed, void *out)

The 128-bit variants requires a 64-bit data type and are not compatible
with pure C89 platforms. The 32-bit variant is fully C89-compatible.

Aumasson, Bernstein and Boßlet have shown [sip] [ocert-2012-001] that
Murmur3 is not resilient against hash collision attacks. Therefore,
Murmur3 can no longer be considered as secure algorithm. It still may be
an alternative if hash collision attacks are of no concern.

CityHash

CityHash [city] is a family of non-cryptographic hash function developed
by Geoff Pike and Jyrki Alakuijala for Google. The C++ reference
implementation has been released under MIT license. The algorithm is
partly based on MurmurHash and claims to be faster. It supports 64- and
128-bit output with a 128-bit seed as well as 32-bit output without
seed.

The relevant function prototype for 64-bit CityHash with 128-bit seed
is:

    uint64 CityHash64WithSeeds(const char *buf, size_t len, uint64 seed0,
                               uint64 seed1)

CityHash also offers SSE 4.2 optimizations with CRC32 intrinsic for long
inputs. All variants except CityHash32 require 64-bit data types.
CityHash32 uses only 32-bit data types but it doesn't support seeding.

Like MurmurHash Aumasson, Bernstein and Boßlet have shown [sip] a
similar weakness in CityHash.

DJBX33A

DJBX33A is a very simple multiplication and addition algorithm by Daniel
J. Bernstein. It is fast and has low setup costs but it's not secure
against hash collision attacks. Its properties make it a viable choice
for small string hashing optimization.

Other

Crypto algorithms such as HMAC, MD5, SHA-1 or SHA-2 are too slow and
have high setup and finalization costs. For these reasons they are not
considered fit for this purpose. Modern AMD and Intel CPUs have AES-NI
(AES instruction set) [aes-ni] to speed up AES encryption. CMAC with
AES-NI might be a viable option but it's probably too slow for daily
operation. (testing required)

Conclusion

SipHash provides the best combination of speed and security. Developers
of other prominent projects have came to the same conclusion.

Small string optimization

Hash functions like SipHash24 have a costly initialization and
finalization code that can dominate speed of the algorithm for very
short strings. On the other hand, Python calculates the hash value of
short strings quite often. A simple and fast function for especially for
hashing of small strings can make a measurable impact on performance.
For example, these measurements were taken during a run of Python's
regression tests. Additional measurements of other code have shown a
similar distribution.

+-------+--------------+---------+
| bytes | hash() calls | portion |
+=======+==============+=========+
| 1     |   18709      |   0.2%  |
+-------+--------------+---------+
| 2     |   737480     |   9.5%  |
+-------+--------------+---------+
| 3     |   636178     |   17.6% |
+-------+--------------+---------+
| 4     |   1518313    |   36.7% |
+-------+--------------+---------+
| 5     |   643022     |   44.9% |
+-------+--------------+---------+
| 6     |   770478     |   54.6% |
+-------+--------------+---------+
| 7     |   525150     |   61.2% |
+-------+--------------+---------+
| 8     |   304873     |   65.1% |
+-------+--------------+---------+
| 9     |   297272     |   68.8% |
+-------+--------------+---------+
| 10    |   68191      |   69.7% |
+-------+--------------+---------+
| 11    |   1388484    |   87.2% |
+-------+--------------+---------+
| 12    |   480786     |   93.3% |
+-------+--------------+---------+
| 13    |   52730      |   93.9% |
+-------+--------------+---------+
| 14    |   65309      |   94.8% |
+-------+--------------+---------+
| 15    |   44245      |   95.3% |
+-------+--------------+---------+
| 16    |   85643      |   96.4% |
+-------+--------------+---------+
| Total |   7921678    |         |
+-------+--------------+---------+

However a fast function like DJBX33A is not as secure as SipHash24. A
cutoff at about 5 to 7 bytes should provide a decent safety margin and
speed up at the same time. The PEP's reference implementation provides
such a cutoff with Py_HASH_CUTOFF. The optimization is disabled by
default for several reasons. For one the security implications are
unclear yet and should be thoroughly studied before the optimization is
enabled by default. Secondly the performance benefits vary. On 64 bit
Linux system with Intel Core i7 multiple runs of Python's benchmark
suite [pybench] show an average speedups between 3% and 5% for
benchmarks such as django_v2, mako and etree with a cutoff of 7.
Benchmarks with X86 binaries and Windows X86_64 builds on the same
machine are a bit slower with small string optimization.

The state of small string optimization will be assessed during the beta
phase of Python 3.4. The feature will either be enabled with appropriate
values or the code will be removed before beta 2 is released.

C API additions

All C API extension modifications are not part of the stable API.

hash secret

The _Py_HashSecret_t type of Python 2.6 to 3.3 has two members with
either 32- or 64-bit length each. SipHash requires two 64-bit unsigned
integers as keys. The typedef will be changed to a union with a
guaranteed size of 24 bytes on all architectures. The union provides a
128 bit random key for SipHash24 and FNV as well as an additional value
of 64 bit for the optional small string optimization and pyexpat seed.
The additional 64 bit seed ensures that pyexpat or small string
optimization cannot reveal bits of the SipHash24 seed.

memory layout on 64 bit systems:

    cccccccc cccccccc cccccccc  uc -- unsigned char[24]
    pppppppp ssssssss ........  fnv -- two Py_hash_t
    k0k0k0k0 k1k1k1k1 ........  siphash -- two PY_UINT64_T
    ........ ........ ssssssss  djbx33a -- 16 bytes padding + one Py_hash_t
    ........ ........ eeeeeeee  pyexpat XML hash salt

memory layout on 32 bit systems:

    cccccccc cccccccc cccccccc  uc -- unsigned char[24]
    ppppssss ........ ........  fnv -- two Py_hash_t
    k0k0k0k0 k1k1k1k1 ........  siphash -- two PY_UINT64_T (if available)
    ........ ........ ssss....  djbx33a -- 16 bytes padding + one Py_hash_t
    ........ ........ eeee....  pyexpat XML hash salt

new type definition:

    typedef union {
        /* ensure 24 bytes */
        unsigned char uc[24];
        /* two Py_hash_t for FNV */
        struct {
            Py_hash_t prefix;
            Py_hash_t suffix;
        } fnv;
    #ifdef PY_UINT64_T
        /* two uint64 for SipHash24 */
        struct {
            PY_UINT64_T k0;
            PY_UINT64_T k1;
        } siphash;
    #endif
        /* a different (!) Py_hash_t for small string optimization */
        struct {
            unsigned char padding[16];
            Py_hash_t suffix;
        } djbx33a;
        struct {
            unsigned char padding[16];
            Py_hash_t hashsalt;
        } expat;
    } _Py_HashSecret_t;
    PyAPI_DATA(_Py_HashSecret_t) _Py_HashSecret;

_Py_HashSecret_t is initialized in Python/random.c:_PyRandom_Init()
exactly once at startup.

hash function definition

Implementation:

    typedef struct {
        /* function pointer to hash function, e.g. fnv or siphash24 */
        Py_hash_t (*const hash)(const void *, Py_ssize_t);
        const char *name;       /* name of the hash algorithm and variant */
        const int hash_bits;    /* internal size of hash value */
        const int seed_bits;    /* size of seed input */
    } PyHash_FuncDef;

    PyAPI_FUNC(PyHash_FuncDef*) PyHash_GetFuncDef(void);

autoconf

A new test is added to the configure script. The test sets
HAVE_ALIGNED_REQUIRED, when it detects a platform, that requires aligned
memory access for integers. Must current platforms such as X86, X86_64
and modern ARM don't need aligned data.

A new option --with-hash-algorithm enables the user to select a hash
algorithm in the configure step.

hash function selection

The value of the macro Py_HASH_ALGORITHM defines which hash algorithm is
used internally. It may be set to any of the three values
Py_HASH_SIPHASH24, Py_HASH_FNV or Py_HASH_EXTERNAL. If Py_HASH_ALGORITHM
is not defined at all, then the best available algorithm is selected. On
platforms which don't require aligned memory access
(HAVE_ALIGNED_REQUIRED not defined) and an unsigned 64 bit integer type
PY_UINT64_T, SipHash24 is used. On strict C89 platforms without a 64 bit
data type, or architectures such as SPARC, FNV is selected as fallback.
A hash algorithm can be selected with an autoconf option, for example
./configure --with-hash-algorithm=fnv.

The value Py_HASH_EXTERNAL allows 3rd parties to provide their own
implementation at compile time.

Implementation:

    #if Py_HASH_ALGORITHM == Py_HASH_EXTERNAL
    extern PyHash_FuncDef PyHash_Func;
    #elif Py_HASH_ALGORITHM == Py_HASH_SIPHASH24
    static PyHash_FuncDef PyHash_Func = {siphash24, "siphash24", 64, 128};
    #elif Py_HASH_ALGORITHM == Py_HASH_FNV
    static PyHash_FuncDef PyHash_Func = {fnv, "fnv", 8 * sizeof(Py_hash_t),
                                         16 * sizeof(Py_hash_t)};
    #endif

Python API addition

sys module

The sys module already has a hash_info struct sequence. More fields are
added to the object to reflect the active hash algorithm and its
properties.

    sys.hash_info(width=64,
                  modulus=2305843009213693951,
                  inf=314159,
                  nan=0,
                  imag=1000003,
                  # new fields:
                  algorithm='siphash24',
                  hash_bits=64,
                  seed_bits=128,
                  cutoff=0)

Necessary modifications to C code

_Py_HashBytes() (Objects/object.c)

_Py_HashBytes is an internal helper function that provides the hashing
code for bytes, memoryview and datetime classes. It currently implements
FNV for unsigned char *.

The function is moved to Python/pyhash.c and modified to use the hash
function through PyHash_Func.hash(). The function signature is altered
to take a const void * as first argument. _Py_HashBytes also takes care
of special cases: it maps zero length input to 0 and return value of -1
to -2.

bytes_hash() (Objects/bytesobject.c)

bytes_hash uses _Py_HashBytes to provide the tp_hash slot function for
bytes objects. The function will continue to use _Py_HashBytes but
without a type cast.

memory_hash() (Objects/memoryobject.c)

memory_hash provides the tp_hash slot function for read-only memory
views if the original object is hashable, too. It's the only function
that has to support hashing of unaligned memory segments in the future.
The function will continue to use _Py_HashBytes but without a type cast.

unicode_hash() (Objects/unicodeobject.c)

unicode_hash provides the tp_hash slot function for unicode. Right now
it implements the FNV algorithm three times for unsigned char*, Py_UCS2
and Py_UCS4. A reimplementation of the function must take care to use
the correct length. Since the macro PyUnicode_GET_LENGTH returns the
length of the unicode string and not its size in octets, the length must
be multiplied with the size of the internal unicode kind:

    if (PyUnicode_READY(u) == -1)
        return -1;
    x = _Py_HashBytes(PyUnicode_DATA(u),
                      PyUnicode_GET_LENGTH(u) * PyUnicode_KIND(u));

generic_hash() (Modules/_datetimemodule.c)

generic_hash acts as a wrapper around _Py_HashBytes for the tp_hash
slots of date, time and datetime types. timedelta objects are hashed by
their state (days, seconds, microseconds) and tzinfo objects are not
hashable. The data members of date, time and datetime types' struct are
not void* aligned. This can easily by fixed with memcpy()ing four to ten
bytes to an aligned buffer.

Performance

In general the PEP 456 code with SipHash24 is about as fast as the old
code with FNV. SipHash24 seems to make better use of modern compilers,
CPUs and large L1 cache. Several benchmarks show a small speed
improvement on 64 bit CPUs such as Intel Core i5 and Intel Core i7
processes. 32 bit builds and benchmarks on older CPUs such as an AMD
Athlon X2 are slightly slower with SipHash24. The performance increase
or decrease are so small that they should not affect any application
code.

The benchmarks were conducted on CPython default branch revision
b08868fd5994 and the PEP repository [pep-456-repos]. All upstream
changes were merged into the pep-456 branch. The "performance" CPU
governor was configured and almost all programs were stopped so the
benchmarks were able to utilize TurboBoost and the CPU caches as much as
possible. The raw benchmark results of multiple machines and platforms
are made available at [benchmarks].

Hash value distribution

A good distribution of hash values is important for dict and set
performance. Both SipHash24 and FNV take the length of the input into
account, so that strings made up entirely of NULL bytes don't have the
same hash value. The last bytes of the input tend to affect the least
significant bits of the hash value, too. That attribute reduces the
amount of hash collisions for strings with a common prefix.

Typical length

Serhiy Storchaka has shown in [issue16427] that a modified FNV
implementation with 64 bits per cycle is able to process long strings
several times faster than the current FNV implementation.

However, according to statistics [issue19183] a typical Python program
as well as the Python test suite have a hash ratio of about 50% small
strings between 1 and 6 bytes. Only 5% of the strings are larger than 16
bytes.

Grand Unified Python Benchmark Suite

Initial tests with an experimental implementation and the Grand Unified
Python Benchmark Suite have shown minimal deviations. The summarized
total runtime of the benchmark is within 1% of the runtime of an
unmodified Python 3.4 binary. The tests were run on an Intel i7-2860QM
machine with a 64-bit Linux installation. The interpreter was compiled
with GCC 4.7 for 64- and 32-bit.

More benchmarks will be conducted.

Backwards Compatibility

The modifications don't alter any existing API.

The output of hash() for strings and bytes are going to be different.
The hash values for ASCII Unicode and ASCII bytes will stay equal.

Alternative counter measures against hash collision DoS

Three alternative countermeasures against hash collisions were discussed
in the past, but are not subject of this PEP.

1.  Marc-Andre Lemburg has suggested that dicts shall count hash
    collisions. In case an insert operation causes too many collisions
    an exception shall be raised.
2.  Some applications (e.g. PHP) limit the amount of keys for GET and
    POST HTTP requests. The approach effectively leverages the impact of
    a hash collision attack. (XXX citation needed)
3.  Hash maps have a worst case of O(n) for insertion and lookup of
    keys. This results in a quadratic runtime during a hash collision
    attack. The introduction of a new and additional data structure with
    O(log n) worst case behavior would eliminate the root cause. A data
    structures like red-black-tree or prefix trees (trie [trie]) would
    have other benefits, too. Prefix trees with stringed keyed can
    reduce memory usage as common prefixes are stored within the tree
    structure.

Discussion

Pluggable

The first draft of this PEP made the hash algorithm pluggable at
runtime. It supported multiple hash algorithms in one binary to give the
user the possibility to select a hash algorithm at startup. The approach
was considered an unnecessary complication by several core committers
[pluggable]. Subsequent versions of the PEP aim for compile time
configuration.

Non-aligned memory access

The implementation of SipHash24 were criticized because it ignores the
issue of non-aligned memory and therefore doesn't work on architectures
that requires alignment of integer types. The PEP deliberately neglects
this special case and doesn't support SipHash24 on such platforms. It's
simply not considered worth the trouble until proven otherwise. All
major platforms like X86, X86_64 and ARMv6+ can handle unaligned memory
with minimal or even no speed impact. [alignmentmyth]

Almost every block is properly aligned anyway. At present bytes' and
str's data are always aligned. Only memoryviews can point to unaligned
blocks under rare circumstances. The PEP implementation is optimized and
simplified for the common case.

ASCII str / bytes hash collision

Since the implementation of PEP 393, bytes and ASCII text have the same
memory layout. Because of this the new hashing API will keep the
invariant:

    hash("ascii string") == hash(b"ascii string")

for ASCII string and ASCII bytes. Equal hash values result in a hash
collision and therefore cause a minor speed penalty for dicts and sets
with mixed keys. The cause of the collision could be removed by e.g.
subtracting 2 from the hash value of bytes. -2 because hash(b"") == 0
and -1 is reserved. The PEP doesn't change the hash value.

References

-   Issue 19183 [issue19183] contains a reference implementation.

Copyright

This document has been placed in the public domain.



  Local Variables: mode: indented-text indent-tabs-mode: nil
  sentence-end-double-space: t fill-column: 70 coding: utf-8 End:

29c3

    http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html

aes-ni

    http://en.wikipedia.org/wiki/AES_instruction_set

alignmentmyth

    http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/

benchmarks

    https://bitbucket.org/tiran/pep-456-benchmarks/src

city

    http://code.google.com/p/cityhash/

csiphash

    https://github.com/majek/csiphash/

fnv

    http://en.wikipedia.org/wiki/Fowler-Noll-Vo_hash_function

issue13703

    http://bugs.python.org/issue13703

issue14621

    http://bugs.python.org/issue14621

issue16427

    http://bugs.python.org/issue16427

issue19183

    http://bugs.python.org/issue19183

murmur

    http://code.google.com/p/smhasher/

ocert

    http://www.nruns.com/_downloads/advisory28122011.pdf

ocert-2012-001

    http://www.ocert.org/advisories/ocert-2012-001.html

pep-456-repos

    http://hg.python.org/features/pep-456

pluggable

    https://mail.python.org/pipermail/python-dev/2013-October/129138.html

poc

    https://131002.net/siphash/poc.py

pybench

    http://hg.python.org/benchmarks/

sip

    https://131002.net/siphash/

trie

    http://en.wikipedia.org/wiki/Trie