PEP: 3116 Title: New I/O Version: $Revision$ Last-Modified: $Date$
Author: Daniel Stutzbach <daniel@stutzbachenterprises.com>, Guido van
Rossum <guido@python.org>, Mike Verdone <mike.verdone@gmail.com> Status:
Final Type: Standards Track Content-Type: text/x-rst Created:
26-Feb-2007 Python-Version: 3.0 Post-History: 26-Feb-2007

Rationale and Goals

Python allows for a variety of stream-like (a.k.a. file-like) objects
that can be used via read() and write() calls. Anything that provides
read() and write() is stream-like. However, more exotic and extremely
useful functions like readline() or seek() may or may not be available
on every stream-like object. Python needs a specification for basic
byte-based I/O streams to which we can add buffering and text-handling
features.

Once we have a defined raw byte-based I/O interface, we can add
buffering and text handling layers on top of any byte-based I/O class.
The same buffering and text handling logic can be used for files,
sockets, byte arrays, or custom I/O classes developed by Python
programmers. Developing a standard definition of a stream lets us
separate stream-based operations like read() and write() from
implementation specific operations like fileno() and isatty(). It
encourages programmers to write code that uses streams as streams and
not require that all streams support file-specific or socket-specific
operations.

The new I/O spec is intended to be similar to the Java I/O libraries,
but generally less confusing. Programmers who don't want to muck about
in the new I/O world can expect that the open() factory method will
produce an object backwards-compatible with old-style file objects.

Specification

The Python I/O Library will consist of three layers: a raw I/O layer, a
buffered I/O layer, and a text I/O layer. Each layer is defined by an
abstract base class, which may have multiple implementations. The raw
I/O and buffered I/O layers deal with units of bytes, while the text I/O
layer deals with units of characters.

Raw I/O

The abstract base class for raw I/O is RawIOBase. It has several methods
which are wrappers around the appropriate operating system calls. If one
of these functions would not make sense on the object, the
implementation must raise an IOError exception. For example, if a file
is opened read-only, the .write() method will raise an IOError. As
another example, if the object represents a socket, then .seek(),
.tell(), and .truncate() will raise an IOError. Generally, a call to one
of these functions maps to exactly one operating system call.

  .read(n: int) -> bytes

    Read up to n bytes from the object and return them. Fewer than n
    bytes may be returned if the operating system call returns fewer
    than n bytes. If 0 bytes are returned, this indicates end of file.
    If the object is in non-blocking mode and no bytes are available,
    the call returns None.

  .readinto(b: bytes) -> int

    Read up to len(b) bytes from the object and stores them in b,
    returning the number of bytes read. Like .read, fewer than len(b)
    bytes may be read, and 0 indicates end of file. None is returned if
    a non-blocking object has no bytes available. The length of b is
    never changed.

  .write(b: bytes) -> int

    Returns number of bytes written, which may be < len(b).

  .seek(pos: int, whence: int = 0) -> int

  .tell() -> int

  .truncate(n: int = None) -> int

  .close() -> None

Additionally, it defines a few other methods:

  .readable() -> bool

    Returns True if the object was opened for reading, False otherwise.
    If False, .read() will raise an IOError if called.

  .writable() -> bool

    Returns True if the object was opened for writing, False otherwise.
    If False, .write() and .truncate() will raise an IOError if called.

  .seekable() -> bool

    Returns True if the object supports random access (such as disk
    files), or False if the object only supports sequential access (such
    as sockets, pipes, and ttys). If False, .seek(), .tell(), and
    .truncate() will raise an IOError if called.

  .__enter__() -> ContextManager

    Context management protocol. Returns self.

  .__exit__(...) -> None

    Context management protocol. Same as .close().

If and only if a RawIOBase implementation operates on an underlying file
descriptor, it must additionally provide a .fileno() member function.
This could be defined specifically by the implementation, or a mix-in
class could be used (need to decide about this).

  .fileno() -> int

    Returns the underlying file descriptor (an integer)

Initially, three implementations will be provided that implement the
RawIOBase interface: FileIO, SocketIO (in the socket module), and
ByteIO. Each implementation must determine whether the object supports
random access as the information provided by the user may not be
sufficient (consider open("/dev/tty", "rw") or
open("/tmp/named-pipe", "rw")). As an example, FileIO can determine this
by calling the seek() system call; if it returns an error, the object
does not support random access. Each implementation may provided
additional methods appropriate to its type. The ByteIO object is
analogous to Python 2's cStringIO library, but operating on the new
bytes type instead of strings.

Buffered I/O

The next layer is the Buffered I/O layer which provides more efficient
access to file-like objects. The abstract base class for all Buffered
I/O implementations is BufferedIOBase, which provides similar methods to
RawIOBase:

  .read(n: int = -1) -> bytes

    Returns the next n bytes from the object. It may return fewer than n
    bytes if end-of-file is reached or the object is non-blocking. 0
    bytes indicates end-of-file. This method may make multiple calls to
    RawIOBase.read() to gather the bytes, or may make no calls to
    RawIOBase.read() if all of the needed bytes are already buffered.

  .readinto(b: bytes) -> int

  .write(b: bytes) -> int

    Write b bytes to the buffer. The bytes are not guaranteed to be
    written to the Raw I/O object immediately; they may be buffered.
    Returns len(b).

  .seek(pos: int, whence: int = 0) -> int

  .tell() -> int

  .truncate(pos: int = None) -> int

  .flush() -> None

  .close() -> None

  .readable() -> bool

  .writable() -> bool

  .seekable() -> bool

  .__enter__() -> ContextManager

  .__exit__(...) -> None

Additionally, the abstract base class provides one member variable:

  .raw

    A reference to the underlying RawIOBase object.

The BufferedIOBase methods signatures are mostly identical to that of
RawIOBase (exceptions: write() returns None, read()'s argument is
optional), but may have different semantics. In particular,
BufferedIOBase implementations may read more data than requested or
delay writing data using buffers. For the most part, this will be
transparent to the user (unless, for example, they open the same file
through a different descriptor). Also, raw reads may return a short read
without any particular reason; buffered reads will only return a short
read if EOF is reached; and raw writes may return a short count (even
when non-blocking I/O is not enabled!), while buffered writes will raise
IOError when not all bytes could be written or buffered.

There are four implementations of the BufferedIOBase abstract base
class, described below.

BufferedReader

The BufferedReader implementation is for sequential-access read-only
objects. Its .flush() method is a no-op.

BufferedWriter

The BufferedWriter implementation is for sequential-access write-only
objects. Its .flush() method forces all cached data to be written to the
underlying RawIOBase object.

BufferedRWPair

The BufferedRWPair implementation is for sequential-access read-write
objects such as sockets and ttys. As the read and write streams of these
objects are completely independent, it could be implemented by simply
incorporating a BufferedReader and BufferedWriter instance. It provides
a .flush() method that has the same semantics as a BufferedWriter's
.flush() method.

BufferedRandom

The BufferedRandom implementation is for all random-access objects,
whether they are read-only, write-only, or read-write. Compared to the
previous classes that operate on sequential-access objects, the
BufferedRandom class must contend with the user calling .seek() to
reposition the stream. Therefore, an instance of BufferedRandom must
keep track of both the logical and true position within the object. It
provides a .flush() method that forces all cached write data to be
written to the underlying RawIOBase object and all cached read data to
be forgotten (so that future reads are forced to go back to the disk).

Q: Do we want to mandate in the specification that switching between
reading and writing on a read-write object implies a .flush()? Or is
that an implementation convenience that users should not rely on?

For a read-only BufferedRandom object, .writable() returns False and the
.write() and .truncate() methods throw IOError.

For a write-only BufferedRandom object, .readable() returns False and
the .read() method throws IOError.

Text I/O

The text I/O layer provides functions to read and write strings from
streams. Some new features include universal newlines and character set
encoding and decoding. The Text I/O layer is defined by a TextIOBase
abstract base class. It provides several methods that are similar to the
BufferedIOBase methods, but operate on a per-character basis instead of
a per-byte basis. These methods are:

  .read(n: int = -1) -> str

  .write(s: str) -> int

  .tell() -> object

    Return a cookie describing the current file position. The only
    supported use for the cookie is with .seek() with whence set to 0
    (i.e. absolute seek).

  .seek(pos: object, whence: int = 0) -> int

    Seek to position pos. If pos is non-zero, it must be a cookie
    returned from .tell() and whence must be zero.

  .truncate(pos: object = None) -> int

    Like BufferedIOBase.truncate(), except that pos (if not None) must
    be a cookie previously returned by .tell().

Unlike with raw I/O, the units for .seek() are not specified - some
implementations (e.g. StringIO) use characters and others (e.g.
TextIOWrapper) use bytes. The special case for zero is to allow going to
the start or end of a stream without a prior .tell(). An implementation
could include stream encoder state in the cookie returned from .tell().

TextIOBase implementations also provide several methods that are
pass-throughs to the underlying BufferedIOBase objects:

  .flush() -> None

  .close() -> None

  .readable() -> bool

  .writable() -> bool

  .seekable() -> bool

TextIOBase class implementations additionally provide the following
methods:

  .readline() -> str

    Read until newline or EOF and return the line, or "" if EOF hit
    immediately.

  .__iter__() -> Iterator

    Returns an iterator that returns lines from the file (which happens
    to be self).

  .next() -> str

    Same as readline() except raises StopIteration if EOF hit
    immediately.

Two implementations will be provided by the Python library. The primary
implementation, TextIOWrapper, wraps a Buffered I/O object. Each
TextIOWrapper object has a property named ".buffer" that provides a
reference to the underlying BufferedIOBase object. Its initializer has
the following signature:

  .__init__(self, buffer, encoding=None, errors=None, newline=None, line_buffering=False)

    buffer is a reference to the BufferedIOBase object to be wrapped
    with the TextIOWrapper.

    encoding refers to an encoding to be used for translating between
    the byte-representation and character-representation. If it is None,
    then the system's locale setting will be used as the default.

    errors is an optional string indicating error handling. It may be
    set whenever encoding may be set. It defaults to 'strict'.

    newline can be None, '', '\n', '\r', or '\r\n'; all other values are
    illegal. It controls the handling of line endings. It works as
    follows:

    -   On input, if newline is None, universal newlines mode is
        enabled. Lines in the input can end in '\n', '\r', or '\r\n',
        and these are translated into '\n' before being returned to the
        caller. If it is '', universal newline mode is enabled, but line
        endings are returned to the caller untranslated. If it has any
        of the other legal values, input lines are only terminated by
        the given string, and the line ending is returned to the caller
        untranslated. (In other words, translation to '\n' only occurs
        if newline is None.)
    -   On output, if newline is None, any '\n' characters written are
        translated to the system default line separator, os.linesep. If
        newline is '', no translation takes place. If newline is any of
        the other legal values, any '\n' characters written are
        translated to the given string. (Note that the rules guiding
        translation are different for output than for input.)

    line_buffering, if True, causes write() calls to imply a flush() if
    the string written contains at least one '\n' or '\r' character.
    This is set by open() when it detects that the underlying stream is
    a TTY device, or when a buffering argument of 1 is passed.

    Further notes on the newline parameter:

    -   '\r' support is still needed for some OSX applications that
        produce files using '\r' line endings; Excel (when exporting to
        text) and Adobe Illustrator EPS files are the most common
        examples.
    -   If translation is enabled, it happens regardless of which method
        is called for reading or writing. For example, f.read() will
        always produce the same result as ''.join(f.readlines()).
    -   If universal newlines without translation are requested on input
        (i.e. newline=''), if a system read operation returns a buffer
        ending in '\r', another system read operation is done to
        determine whether it is followed by '\n' or not. In universal
        newlines mode with translation, the second system read operation
        may be postponed until the next read request, and if the
        following system read operation returns a buffer starting with
        '\n', that character is simply discarded.

Another implementation, StringIO, creates a file-like TextIO
implementation without an underlying Buffered I/O object. While similar
functionality could be provided by wrapping a BytesIO object in a
TextIOWrapper, the StringIO object allows for much greater efficiency as
it does not need to actually performing encoding and decoding. A String
I/O object can just store the encoded string as-is. The StringIO
object's __init__ signature takes an optional string specifying the
initial value; the initial position is always 0. It does not support
encodings or newline translations; you always read back exactly the
characters you wrote.

Unicode encoding/decoding Issues

We should allow changing the encoding and error-handling setting later.
The behavior of Text I/O operations in the face of Unicode problems and
ambiguities (e.g. diacritics, surrogates, invalid bytes in an encoding)
should be the same as that of the unicode encode()/decode() methods.
UnicodeError may be raised.

Implementation note: we should be able to reuse much of the
infrastructure provided by the codecs module. If it doesn't provide the
exact APIs we need, we should refactor it to avoid reinventing the
wheel.

Non-blocking I/O

Non-blocking I/O is fully supported on the Raw I/O level only. If a raw
object is in non-blocking mode and an operation would block, then
.read() and .readinto() return None, while .write() returns 0. In order
to put an object in non-blocking mode, the user must extract the fileno
and do it by hand.

At the Buffered I/O and Text I/O layers, if a read or write fails due a
non-blocking condition, they raise an IOError with errno set to EAGAIN.

Originally, we considered propagating up the Raw I/O behavior, but many
corner cases and problems were raised. To address these issues,
significant changes would need to have been made to the Buffered I/O and
Text I/O layers. For example, what should .flush() do on a Buffered
non-blocking object? How would the user instruct the object to "Write as
much as you can from your buffer, but don't block"? A non-blocking
.flush() that doesn't necessarily flush all available data is
counter-intuitive. Since non-blocking and blocking objects would have
such different semantics at these layers, it was agreed to abandon
efforts to combine them into a single type.

The open() Built-in Function

The open() built-in function is specified by the following pseudo-code:

    def open(filename, mode="r", buffering=None, *,
             encoding=None, errors=None, newline=None):
        assert isinstance(filename, (str, int))
        assert isinstance(mode, str)
        assert buffering is None or isinstance(buffering, int)
        assert encoding is None or isinstance(encoding, str)
        assert newline in (None, "", "\n", "\r", "\r\n")
        modes = set(mode)
        if modes - set("arwb+t") or len(mode) > len(modes):
            raise ValueError("invalid mode: %r" % mode)
        reading = "r" in modes
        writing = "w" in modes
        binary = "b" in modes
        appending = "a" in modes
        updating = "+" in modes
        text = "t" in modes or not binary
        if text and binary:
            raise ValueError("can't have text and binary mode at once")
        if reading + writing + appending > 1:
            raise ValueError("can't have read/write/append mode at once")
        if not (reading or writing or appending):
            raise ValueError("must have exactly one of read/write/append mode")
        if binary and encoding is not None:
            raise ValueError("binary modes doesn't take an encoding arg")
        if binary and errors is not None:
            raise ValueError("binary modes doesn't take an errors arg")
        if binary and newline is not None:
            raise ValueError("binary modes doesn't take a newline arg")
        # XXX Need to spec the signature for FileIO()
        raw = FileIO(filename, mode)
        line_buffering = (buffering == 1 or buffering is None and raw.isatty())
        if line_buffering or buffering is None:
            buffering = 8*1024  # International standard buffer size
            # XXX Try setting it to fstat().st_blksize
        if buffering < 0:
            raise ValueError("invalid buffering size")
        if buffering == 0:
            if binary:
                return raw
            raise ValueError("can't have unbuffered text I/O")
        if updating:
            buffer = BufferedRandom(raw, buffering)
        elif writing or appending:
            buffer = BufferedWriter(raw, buffering)
        else:
            assert reading
            buffer = BufferedReader(raw, buffering)
        if binary:
            return buffer
        assert text
        return TextIOWrapper(buffer, encoding, errors, newline, line_buffering)

Copyright

This document has been placed in the public domain.



  Local Variables: mode: indented-text indent-tabs-mode: nil
  sentence-end-double-space: t fill-column: 70 coding: utf-8 End: