PEP: 3127 Title: Integer Literal Support and Syntax Version: $Revision$
Last-Modified: $Date$ Author: Patrick Maupin <pmaupin@gmail.com>
Discussions-To: python-3000@python.org Status: Final Type: Standards
Track Content-Type: text/x-rst Created: 14-Mar-2007 Python-Version: 3.0
Post-History: 18-Mar-2007

Abstract

This PEP proposes changes to the Python core to rationalize the
treatment of string literal representations of integers in different
radices (bases). These changes are targeted at Python 3.0, but the
backward-compatible parts of the changes should be added to Python 2.6,
so that all valid 3.0 integer literals will also be valid in 2.6.

The proposal is that:

a)  octal literals must now be specified with a leading "0o" or "0O"
    instead of "0";
b)  binary literals are now supported via a leading "0b" or "0B"; and
c)  provision will be made for binary numbers in string formatting.

Motivation

This PEP was motivated by two different issues:

-   The default octal representation of integers is silently confusing
    to people unfamiliar with C-like languages. It is extremely easy to
    inadvertently create an integer object with the wrong value, because
    '013' means 'decimal 11', not 'decimal 13', to the Python language
    itself, which is not the meaning that most humans would assign to
    this literal.
-   Some Python users have a strong desire for binary support in the
    language.

Specification

Grammar specification

The grammar will be changed. For Python 2.6, the changed and new token
definitions will be:

    integer        ::=     decimalinteger | octinteger | hexinteger |
                           bininteger | oldoctinteger

    octinteger     ::=     "0" ("o" | "O") octdigit+

    bininteger     ::=     "0" ("b" | "B") bindigit+

    oldoctinteger  ::=     "0" octdigit+

    bindigit       ::=     "0" | "1"

For Python 3.0, "oldoctinteger" will not be supported, and an exception
will be raised if a literal has a leading "0" and a second character
which is a digit.

For both versions, this will require changes to PyLong_FromString as
well as the grammar.

The documentation will have to be changed as well: grammar.txt, as well
as the integer literal section of the reference manual.

PEP 306 should be checked for other issues, and that PEP should be
updated if the procedure described therein is insufficient.

int() specification

int(s, 0) will also match the new grammar definition.

This should happen automatically with the changes to PyLong_FromString
required for the grammar change.

Also the documentation for int() should be changed to explain that
int(s) operates identically to int(s, 10), and the word "guess" should
be removed from the description of int(s, 0).

long() specification

For Python 2.6, the long() implementation and documentation should be
changed to reflect the new grammar.

Tokenizer exception handling

If an invalid token contains a leading "0", the exception error message
should be more informative than the current "SyntaxError: invalid
token". It should explain that decimal numbers may not have a leading
zero, and that octal numbers require an "o" after the leading zero.

int() exception handling

The ValueError raised for any call to int() with a string should at
least explicitly contain the base in the error message, e.g.:

    ValueError: invalid literal for base 8 int(): 09

oct() function

oct() should be updated to output '0o' in front of the octal digits (for
3.0, and 2.6 compatibility mode).

Output formatting

In 3.0, the string % operator alternate syntax for the 'o' option will
need to be updated to add '0o' in front, instead of '0'. In 2.6,
alternate octal formatting will continue to add only '0'. In neither 2.6
nor 3.0 will the % operator support binary output. This is because
binary output is already supported by PEP 3101 (str.format), which is
the preferred string formatting method.

Transition from 2.6 to 3.0

The 2to3 translator will have to insert 'o' into any octal string
literal.

The Py3K compatible option to Python 2.6 should cause attempts to use
oldoctinteger literals to raise an exception.

Rationale

Most of the discussion on these issues occurred on the Python-3000
mailing list starting 14-Mar-2007, prompted by an observation that the
average human being would be completely mystified upon finding that
prepending a "0" to a string of digits changes the meaning of that digit
string entirely.

It was pointed out during this discussion that a similar, but shorter,
discussion on the subject occurred in January 2006, prompted by a
discovery of the same issue.

Background

For historical reasons, Python's string representation of integers in
different bases (radices), for string formatting and token literals,
borrows heavily from C.[1][2] Usage has shown that the historical method
of specifying an octal number is confusing, and also that it would be
nice to have additional support for binary literals.

Throughout this document, unless otherwise noted, discussions about the
string representation of integers relate to these features:

-   Literal integer tokens, as used by normal module compilation, by
    eval(), and by int(token, 0). (int(token) and int(token, 2-36) are
    not modified by this proposal.)
    -   Under 2.6, long() is treated the same as int()
-   Formatting of integers into strings, either via the % string
    operator or the new PEP 3101 advanced string formatting method.

It is presumed that:

-   All of these features should have an identical set of supported
    radices, for consistency.
-   Python source code syntax and int(mystring, 0) should continue to
    share identical behavior.

Removal of old octal syntax

This PEP proposes that the ability to specify an octal number by using a
leading zero will be removed from the language in Python 3.0 (and the
Python 3.0 preview mode of 2.6), and that a SyntaxError will be raised
whenever a leading "0" is immediately followed by another digit.

During the present discussion, it was almost universally agreed that:

    eval('010') == 8

should no longer be true, because that is confusing to new users. It was
also proposed that:

    eval('0010') == 10

should become true, but that is much more contentious, because it is so
inconsistent with usage in other computer languages that mistakes are
likely to be made.

Almost all currently popular computer languages, including C/C++, Java,
Perl, and JavaScript, treat a sequence of digits with a leading zero as
an octal number. Proponents of treating these numbers as decimal instead
have a very valid point -- as discussed in Supported radices, below, the
entire non-computer world uses decimal numbers almost exclusively. There
is ample anecdotal evidence that many people are dismayed and confused
if they are confronted with non-decimal radices.

However, in most situations, most people do not write gratuitous zeros
in front of their decimal numbers. The primary exception is when an
attempt is being made to line up columns of numbers. But since PEP 8
specifically discourages the use of spaces to try to align Python code,
one would suspect the same argument should apply to the use of leading
zeros for the same purpose.

Finally, although the email discussion often focused on whether anybody
actually uses octal any more, and whether we should cater to those
old-timers in any case, that is almost entirely besides the point.

Assume the rare complete newcomer to computing who does, either
occasionally or as a matter of habit, use leading zeros for decimal
numbers. Python could either:

a)  silently do the wrong thing with their numbers, as it does now;
b)  immediately disabuse them of the notion that this is viable syntax
    (and yes, the SyntaxWarning should be more gentle than it currently
    is, but that is a subject for a different PEP); or
c)  let them continue to think that computers are happy with multi-digit
    decimal integers which start with "0".

Some people passionately believe that (c) is the correct answer, and
they would be absolutely right if we could be sure that new users will
never blossom and grow and start writing AJAX applications.

So while a new Python user may (currently) be mystified at the delayed
discovery that their numbers don't work properly, we can fix it by
explaining to them immediately that Python doesn't like leading zeros
(hopefully with a reasonable message!), or we can delegate this teaching
experience to the JavaScript interpreter in the browser, and let them
try to debug their issue there.

Supported radices

This PEP proposes that the supported radices for the Python language
will be 2, 8, 10, and 16.

Once it is agreed that the old syntax for octal (radix 8) representation
of integers must be removed from the language, the next obvious question
is "Do we actually need a way to specify (and display) numbers in
octal?"

This question is quickly followed by "What radices does the language
need to support?" Because computers are so adept at doing what you tell
them to, a tempting answer in the discussion was "all of them." This
answer has obviously been given before -- the int() constructor will
accept an explicit radix with a value between 2 and 36, inclusive, with
the latter number bearing a suspicious arithmetic similarity to the sum
of the number of numeric digits and the number of same-case letters in
the ASCII alphabet.

But the best argument for inclusion will have a use-case to back it up,
so the idea of supporting all radices was quickly rejected, and the only
radices left with any real support were decimal, hexadecimal, octal, and
binary.

Just because a particular radix has a vocal supporter on the mailing
list does not mean that it really should be in the language, so the rest
of this section is a treatise on the utility of these particular
radices, vs. other possible choices.

Humans use other numeric bases constantly. If I tell you that it is
12:30 PM, I have communicated quantitative information arguably composed
of three separate bases (12, 60, and 2), only one of which is in the
"agreed" list above. But the communication of that information used two
decimal digits each for the base 12 and base 60 information, and,
perversely, two letters for information which could have fit in a single
decimal digit.

So, in general, humans communicate "normal" (non-computer) numerical
information either via names (AM, PM, January, ...) or via use of
decimal notation. Obviously, names are seldom used for large sets of
items, so decimal is used for everything else. There are studies which
attempt to explain why this is so, typically reaching the expected
conclusion that the Arabic numeral system is well-suited to human
cognition.[3]

There is even support in the history of the design of computers to
indicate that decimal notation is the correct way for computers to
communicate with humans. One of the first modern computers, ENIAC[4]
computed in decimal, even though there were already existing computers
which operated in binary.

Decimal computer operation was important enough that many computers,
including the ubiquitous PC, have instructions designed to operate on
"binary coded decimal" (BCD)[5], a representation which devotes 4 bits
to each decimal digit. These instructions date from a time when the most
strenuous calculations ever performed on many numbers were the
calculations actually required to perform textual I/O with them. It is
possible to display BCD without having to perform a divide/remainder
operation on every displayed digit, and this was a huge computational
win when most hardware didn't have fast divide capability. Another
factor contributing to the use of BCD is that, with BCD calculations,
rounding will happen exactly the same way that a human would do it, so
BCD is still sometimes used in fields like finance, despite the
computational and storage superiority of binary.

So, if it weren't for the fact that computers themselves normally use
binary for efficient computation and data storage, string
representations of integers would probably always be in decimal.

Unfortunately, computer hardware doesn't think like humans, so
programmers and hardware engineers must often resort to thinking like
the computer, which means that it is important for Python to have the
ability to communicate binary data in a form that is understandable to
humans.

The requirement that the binary data notation must be cognitively easy
for humans to process means that it should contain an integral number of
binary digits (bits) per symbol, while otherwise conforming quite
closely to the standard tried-and-true decimal notation (position
indicates power, larger magnitude on the left, not too many symbols in
the alphabet, etc.).

The obvious "sweet spot" for this binary data notation is thus octal,
which packs the largest integral number of bits possible into a single
symbol chosen from the Arabic numeral alphabet.

In fact, some computer architectures, such as the PDP8 and the 8080/Z80,
were defined in terms of octal, in the sense of arranging the bitfields
of instructions in groups of three, and using octal representations to
describe the instruction set.

Even today, octal is important because of bit-packed structures which
consist of 3 bits per field, such as Unix file permission masks.

But octal has a drawback when used for larger numbers. The number of
bits per symbol, while integral, is not itself a power of two. This
limitation (given that the word size of most computers these days is a
power of two) has resulted in hexadecimal, which is more popular than
octal despite the fact that it requires a 60% larger alphabet than
decimal, because each symbol contains 4 bits.

Some numbers, such as Unix file permission masks, are easily decoded by
humans when represented in octal, but difficult to decode in
hexadecimal, while other numbers are much easier for humans to handle in
hexadecimal.

Unfortunately, there are also binary numbers used in computers which are
not very well communicated in either hexadecimal or octal. Thankfully,
fewer people have to deal with these on a regular basis, but on the
other hand, this means that several people on the discussion list
questioned the wisdom of adding a straight binary representation to
Python.

One example of where these numbers is very useful is in reading and
writing hardware registers. Sometimes hardware designers will eschew
human readability and opt for address space efficiency, by packing
multiple bit fields into a single hardware register at unaligned bit
locations, and it is tedious and error-prone for a human to reconstruct
a 5 bit field which consists of the upper 3 bits of one hex digit, and
the lower 2 bits of the next hex digit.

Even if the ability of Python to communicate binary information to
humans is only useful for a small technical subset of the population, it
is exactly that population subset which contains most, if not all,
members of the Python core team, so even straight binary, the least
useful of these notations, has several enthusiastic supporters and few,
if any, staunch opponents, among the Python community.

Syntax for supported radices

This proposal is to use a "0o" prefix with either uppercase or lowercase
"o" for octal, and a "0b" prefix with either uppercase or lowercase "b"
for binary.

There was strong support for not supporting uppercase, but this is a
separate subject for a different PEP, as 'j' for complex numbers, 'e'
for exponent, and 'r' for raw string (to name a few) already support
uppercase.

The syntax for delimiting the different radices received a lot of
attention in the discussion on Python-3000. There are several (sometimes
conflicting) requirements and "nice-to-haves" for this syntax:

-   It should be as compatible with other languages and previous
    versions of Python as is reasonable, both for the input syntax and
    for the output (e.g. string % operator) syntax.
-   It should be as obvious to the casual observer as possible.
-   It should be easy to visually distinguish integers formatted in the
    different bases.

Proposed syntaxes included things like arbitrary radix prefixes, such as
16r100 (256 in hexadecimal), and radix suffixes, similar to the 100h
assembler-style suffix. The debate on whether the letter "O" could be
used for octal was intense -- an uppercase "O" looks suspiciously
similar to a zero in some fonts. Suggestions were made to use a "c" (the
second letter of "oCtal"), or even to use a "t" for "ocTal" and an "n"
for "biNary" to go along with the "x" for "heXadecimal".

For the string % operator, "o" was already being used to denote octal.
Binary formatting is not being added to the % operator because PEP 3101
(Advanced String Formatting) already supports binary, % formatting will
be deprecated in the future.

At the end of the day, since uppercase "O" can look like a zero and
uppercase "B" can look like an 8, it was decided that these prefixes
should be lowercase only, but, like 'r' for raw string, that can be a
preference or style-guide issue.

Open Issues

It was suggested in the discussion that lowercase should be used for all
numeric and string special modifiers, such as 'x' for hexadecimal, 'r'
for raw strings, 'e' for exponentiation, and 'j' for complex numbers.
This is an issue for a separate PEP.

This PEP takes no position on uppercase or lowercase for input, just
noting that, for consistency, if uppercase is not to be removed from
input parsing for other letters, it should be added for octal and
binary, and documenting the changes under this assumption, as there is
not yet a PEP about the case issue.

Output formatting may be a different story -- there is already ample
precedence for case sensitivity in the output format string, and there
would need to be a consensus that there is a valid use-case for the
"alternate form" of the string % operator to support uppercase 'B' or
'O' characters for binary or octal output. Currently, PEP 3101 does not
even support this alternate capability, and the hex() function does not
allow the programmer to specify the case of the 'x' character.

There are still some strong feelings that '0123' should be allowed as a
literal decimal in Python 3.0. If this is the right thing to do, this
can easily be covered in an additional PEP. This proposal only takes the
first step of making '0123' not be a valid octal number, for reasons
covered in the rationale.

Is there (or should there be) an option for the 2to3 translator which
only makes the 2.6 compatible changes? Should this be run on 2.6 library
code before the 2.6 release?

Should a bin() function which matches hex() and oct() be added?

Is hex() really that useful once we have advanced string formatting?

References

Copyright

This document has been placed in the public domain.



  Local Variables: mode: indented-text indent-tabs-mode: nil
  sentence-end-double-space: t fill-column: 70 coding: utf-8 End:

[1] GNU libc manual printf integer format conversions
(http://www.gnu.org/software/libc/manual/html_node/Integer-Conversions.html)

[2] Python string formatting operations
(http://docs.python.org/library/stdtypes.html#string-formatting-operations)

[3] The Representation of Numbers, Jiajie Zhang and Donald A. Norman
(http://acad88.sahs.uth.tmc.edu/research/publications/Number-Representation.pdf)

[4] ENIAC page at Wikipedia (http://en.wikipedia.org/wiki/ENIAC)

[5] BCD page at Wikipedia
(http://en.wikipedia.org/wiki/Binary-coded_decimal)