PEP: 400 Title: Deprecate codecs.StreamReader and codecs.StreamWriter
Version: $Revision$ Last-Modified: $Date$ Author: Victor Stinner
<vstinner@python.org> Status: Deferred Type: Standards Track
Content-Type: text/x-rst Created: 28-May-2011 Python-Version: 3.3

Abstract

io.TextIOWrapper and codecs.StreamReaderWriter offer the same API [1].
TextIOWrapper has more features and is faster than StreamReaderWriter.
Duplicate code means that bugs should be fixed twice and that we may
have subtle differences between the two implementations.

The codecs module was introduced in Python 2.0 (see the PEP 100). The io
module was introduced in Python 2.6 and 3.0 (see the PEP 3116), and
reimplemented in C in Python 2.7 and 3.1.

PEP Deferral

Further exploration of the concepts covered in this PEP has been
deferred for lack of a current champion interested in promoting the
goals of the PEP and collecting and incorporating feedback, and with
sufficient available time to do so effectively.

Motivation

When the Python I/O model was updated for 3.0, the concept of a
"stream-with-known-encoding" was introduced in the form of
io.TextIOWrapper. As this class is critical to the performance of
text-based I/O in Python 3, this module has an optimised C version which
is used by CPython by default. Many corner cases in handling buffering,
stateful codecs and universal newlines have been dealt with since the
release of Python 3.0.

This new interface overlaps heavily with the legacy codecs.StreamReader,
codecs.StreamWriter and codecs.StreamReaderWriter interfaces that were
part of the original codec interface design in PEP 100. These interfaces
are organised around the principle of an encoding with an associated
stream (i.e. the reverse of arrangement in the io module), so the
original PEP 100 design required that codec writers provide appropriate
StreamReader and StreamWriter implementations in addition to the core
codec encode() and decode() methods. This places a heavy burden on codec
authors providing these specialised implementations to correctly handle
many of the corner cases (see Appendix A) that have now been dealt with
by io.TextIOWrapper. While deeper integration between the codec and the
stream allows for additional optimisations in theory, these
optimisations have in practice either not been carried out and else the
associated code duplication means that the corner cases that have been
fixed in io.TextIOWrapper are still not handled correctly in the various
StreamReader and StreamWriter implementations.

Accordingly, this PEP proposes that:

-   codecs.open() be updated to delegate to the builtin open() in Python
    3.3;
-   the legacy codecs.Stream* interfaces, including the streamreader and
    streamwriter attributes of codecs.CodecInfo be deprecated in Python
    3.3.

Rationale

StreamReader and StreamWriter issues

-   StreamReader is unable to translate newlines.
-   StreamWriter doesn't support "line buffering" (flush if the input
    text contains a newline).
-   StreamReader classes of the CJK encodings (e.g. GB18030) only
    supports UNIX newlines ('\n').
-   StreamReader and StreamWriter are stateful codecs but don't expose
    functions to control their state (getstate() or setstate()). Each
    codec has to handle corner cases, see Appendix A.
-   StreamReader and StreamWriter are very similar to IncrementalReader
    and IncrementalEncoder, some code is duplicated for stateful codecs
    (e.g. UTF-16).
-   Each codec has to reimplement its own StreamReader and StreamWriter
    class, even if it's trivial (just call the encoder/decoder).
-   codecs.open(filename, "r") creates an io.TextIOWrapper object.
-   No codec implements an optimized method in StreamReader or
    StreamWriter based on the specificities of the codec.

Issues in the bug tracker:

-   Issue #5445 (2009-03-08): codecs.StreamWriter.writelines problem
    when passed generator
-   Issue #7262: (2009-11-04): codecs.open() + eol (windows)
-   Issue #8260 (2010-03-29): When I use codecs.open(...) and
    f.readline() follow up by f.read() return bad result
-   Issue #8630 (2010-05-05): Keepends param in codec readline(s)
-   Issue #10344 (2010-11-06): codecs.readline doesn't care buffering
-   Issue #11461 (2011-03-10): Reading UTF-16 with codecs.readline()
    breaks on surrogate pairs
-   Issue #12446 (2011-06-30): StreamReader Readlines behavior odd
-   Issue #12508 (2011-07-06): Codecs Anomaly
-   Issue #12512 (2011-07-07): codecs: StreamWriter issues with stateful
    codecs after a seek or with append mode
-   Issue #12513 (2011-07-07): codec.StreamReaderWriter: issues with
    interlaced read-write

TextIOWrapper features

-   TextIOWrapper supports any kind of newline, including translating
    newlines (to UNIX newlines), to read and write.
-   TextIOWrapper reuses codecs incremental encoders and decoders (no
    duplication of code).
-   The io module (TextIOWrapper) is faster than the codecs module
    (StreamReader). It is implemented in C, whereas codecs is
    implemented in Python.
-   TextIOWrapper has a readahead algorithm which speeds up small reads:
    read character by character or line by line (io is 10x through 25x
    faster than codecs on these operations).
-   TextIOWrapper has a write buffer.
-   TextIOWrapper.tell() is optimized.
-   TextIOWrapper supports random access (read+write) using a single
    class which permit to optimize interlaced read-write (but no such
    optimization is implemented).

TextIOWrapper issues

-   Issue #12215 (2011-05-30): TextIOWrapper: issues with interlaced
    read-write

Possible improvements of StreamReader and StreamWriter

By adding codec state read/write functions to the StreamReader and
StreamWriter classes, it will become possible to fix issues with
stateful codecs in a base class instead of in each stateful StreamReader
and StreamWriter classes.

It would be possible to change StreamReader and StreamWriter to make
them use IncrementalDecoder and IncrementalEncoder.

A codec can implement variants which are optimized for the specific
encoding or intercept certain stream methods to add functionality or
improve the encoding/decoding performance. TextIOWrapper cannot
implement such optimization, but TextIOWrapper uses incremental encoders
and decoders and uses read and write buffers, so the overhead of
incomplete inputs is low or nul.

A lot more could be done for other variable length encoding codecs, e.g.
UTF-8, since these often have problems near the end of a read due to
missing bytes. The UTF-32-BE/LE codecs could simply multiply the
character position by 4 to get the byte position.

Usage of StreamReader and StreamWriter

These classes are rarely used directly, but indirectly using
codecs.open(). They are not used in Python 3 standard library (except in
the codecs module).

Some projects implement their own codec with StreamReader and
StreamWriter, but don't use these classes.

Backwards Compatibility

Keep the public API, codecs.open

codecs.open() can be replaced by the builtin open() function. open() has
a similar API but has also more options. Both functions return file-like
objects (same API).

codecs.open() was the only way to open a text file in Unicode mode until
Python 2.6. Many Python 2 programs uses this function. Removing
codecs.open() implies more work to port programs from Python 2 to Python
3, especially projects using the same code base for the two Python
versions (without using 2to3 program).

codecs.open() is kept for backward compatibility with Python 2.

Deprecate StreamReader and StreamWriter

Instantiating StreamReader or StreamWriter must emit a
DeprecationWarning in Python 3.3. Defining a subclass doesn't emit a
DeprecationWarning.

codecs.open() will be changed to reuse the builtin open() function
(TextIOWrapper) to read-write text files.

Alternative Approach

An alternative to the deprecation of the codecs.Stream* classes is to
rename codecs.open() to codecs.open_stream(), and to create a new
codecs.open() function reusing open() and so io.TextIOWrapper.

Appendix A: Issues with stateful codecs

It is difficult to use correctly a stateful codec with a stream. Some
cases are supported by the codecs module, while io has no more known bug
related to stateful codecs. The main difference between the codecs and
the io module is that bugs have to be fixed in StreamReader and/or
StreamWriter classes of each codec for the codecs module, whereas bugs
can be fixed only once in io.TextIOWrapper. Here are some examples of
issues with stateful codecs.

Stateful codecs

Python supports the following stateful codecs:

-   cp932
-   cp949
-   cp950
-   euc_jis_2004
-   euc_jisx2003
-   euc_jp
-   euc_kr
-   gb18030
-   gbk
-   hz
-   iso2022_jp
-   iso2022_jp_1
-   iso2022_jp_2
-   iso2022_jp_2004
-   iso2022_jp_3
-   iso2022_jp_ext
-   iso2022_kr
-   shift_jis
-   shift_jis_2004
-   shift_jisx0213
-   utf_8_sig
-   utf_16
-   utf_32

Read and seek(0)

    with open(filename, 'w', encoding='utf-16') as f:
        f.write('abc')
        f.write('def')
        f.seek(0)
        assert f.read() == 'abcdef'
        f.seek(0)
        assert f.read() == 'abcdef'

The io and codecs modules support this usecase correctly.

seek(n)

    with open(filename, 'w', encoding='utf-16') as f:
        f.write('abc')
        pos = f.tell()
    with open(filename, 'w', encoding='utf-16') as f:
        f.seek(pos)
        f.write('def')
        f.seek(0)
        f.write('###')
    with open(filename, 'r', encoding='utf-16') as f:
        assert f.read() == '###def'

The io module supports this usecase, whereas codecs fails because it
writes a new BOM on the second write (issue #12512).

Append mode

    with open(filename, 'w', encoding='utf-16') as f:
        f.write('abc')
    with open(filename, 'a', encoding='utf-16') as f:
        f.write('def')
    with open(filename, 'r', encoding='utf-16') as f:
        assert f.read() == 'abcdef'

The io module supports this usecase, whereas codecs fails because it
writes a new BOM on the second write (issue #12512).

Links

-   PEP 100: Python Unicode Integration <100>
-   PEP 3116: New I/O <3116>
-   Issue #8796: Deprecate codecs.open()
-   [python-dev] Deprecate codecs.open() and StreamWriter/StreamReader

Copyright

This document has been placed in the public domain.

Footnotes

[1] StreamReaderWriter has two more attributes than TextIOWrapper,
reader and writer.