PEP: 293 Title: Codec Error Handling Callbacks Version: $Revision$
Last-Modified: $Date$ Author: Walter Dörwald <walter@livinglogic.de>
Status: Final Type: Standards Track Content-Type: text/x-rst Created:
18-Jun-2002 Python-Version: 2.3 Post-History: 19-Jun-2002

Abstract

This PEP aims at extending Python's fixed codec error handling schemes
with a more flexible callback based approach.

Python currently uses a fixed error handling for codec error handlers.
This PEP describes a mechanism which allows Python to use function
callbacks as error handlers. With these more flexible error handlers it
is possible to add new functionality to existing codecs by e.g.
providing fallback solutions or different encodings for cases where the
standard codec mapping does not apply.

Specification

Currently the set of codec error handling algorithms is fixed to either
"strict", "replace" or "ignore" and the semantics of these algorithms is
implemented separately for each codec.

The proposed patch will make the set of error handling algorithms
extensible through a codec error handler registry which maps handler
names to handler functions. This registry consists of the following two
C functions:

    int PyCodec_RegisterError(const char *name, PyObject *error)

    PyObject *PyCodec_LookupError(const char *name)

and their Python counterparts:

    codecs.register_error(name, error)

    codecs.lookup_error(name)

PyCodec_LookupError raises a LookupError if no callback function has
been registered under this name.

Similar to the encoding name registry there is no way of unregistering
callback functions or iterating through the available functions.

The callback functions will be used in the following way by the codecs:
when the codec encounters an encoding/decoding error, the callback
function is looked up by name, the information about the error is stored
in an exception object and the callback is called with this object. The
callback returns information about how to proceed (or raises an
exception).

For encoding, the exception object will look like this:

    class UnicodeEncodeError(UnicodeError):
        def __init__(self, encoding, object, start, end, reason):
            UnicodeError.__init__(self,
                "encoding '%s' can't encode characters " +
                "in positions %d-%d: %s" % (encoding,
                    start, end-1, reason))
            self.encoding = encoding
            self.object = object
            self.start = start
            self.end = end
            self.reason = reason

This type will be implemented in C with the appropriate setter and
getter methods for the attributes, which have the following meaning:

-   encoding: The name of the encoding;
-   object: The original unicode object for which encode() has been
    called;
-   start: The position of the first unencodable character;
-   end: (The position of the last unencodable character)+1 (or the
    length of object, if all characters from start to the end of object
    are unencodable);
-   reason: The reason why object[start:end] couldn't be encoded.

If object has consecutive unencodable characters, the encoder should
collect those characters for one call to the callback if those
characters can't be encoded for the same reason. The encoder is not
required to implement this behaviour but may call the callback for every
single character, but it is strongly suggested that the collecting
method is implemented.

The callback must not modify the exception object. If the callback does
not raise an exception (either the one passed in, or a different one),
it must return a tuple:

    (replacement, newpos)

replacement is a unicode object that the encoder will encode and emit
instead of the unencodable object[start:end] part, newpos specifies a
new position within object, where (after encoding the replacement) the
encoder will continue encoding.

Negative values for newpos are treated as being relative to end of
object. If newpos is out of bounds the encoder will raise an IndexError.

If the replacement string itself contains an unencodable character the
encoder raises the exception object (but may set a different reason
string before raising).

Should further encoding errors occur, the encoder is allowed to reuse
the exception object for the next call to the callback. Furthermore, the
encoder is allowed to cache the result of codecs.lookup_error.

If the callback does not know how to handle the exception, it must raise
a TypeError.

Decoding works similar to encoding with the following differences:

-   The exception class is named UnicodeDecodeError and the attribute
    object is the original 8bit string that the decoder is currently
    decoding.
-   The decoder will call the callback with those bytes that constitute
    one undecodable sequence, even if there is more than one undecodable
    sequence that is undecodable for the same reason directly after the
    first one. E.g. for the "unicode-escape" encoding, when decoding the
    illegal string \\u00\\u01x, the callback will be called twice (once
    for \\u00 and once for \\u01). This is done to be able to generate
    the correct number of replacement characters.
-   The replacement returned from the callback is a unicode object that
    will be emitted by the decoder as-is without further processing
    instead of the undecodable object[start:end] part.

There is a third API that uses the old strict/ignore/replace error
handling scheme:

    PyUnicode_TranslateCharmap/unicode.translate

The proposed patch will enhance PyUnicode_TranslateCharmap, so that it
also supports the callback registry. This has the additional side effect
that PyUnicode_TranslateCharmap will support multi-character replacement
strings (see SF feature request #403100[1]).

For PyUnicode_TranslateCharmap the exception class will be named
UnicodeTranslateError. PyUnicode_TranslateCharmap will collect all
consecutive untranslatable characters (i.e. those that map to None) and
call the callback with them. The replacement returned from the callback
is a unicode object that will be put in the translated result as-is,
without further processing.

All encoders and decoders are allowed to implement the callback
functionality themselves, if they recognize the callback name (i.e. if
it is a system callback like "strict", "replace" and "ignore"). The
proposed patch will add two additional system callback names:
"backslashreplace" and "xmlcharrefreplace", which can be used for
encoding and translating and which will also be implemented in-place for
all encoders and PyUnicode_TranslateCharmap.

The Python equivalent of these five callbacks will look like this:

    def strict(exc):
        raise exc

    def ignore(exc):
        if isinstance(exc, UnicodeError):
            return (u"", exc.end)
        else:
            raise TypeError("can't handle %s" % exc.__name__)

    def replace(exc):
         if isinstance(exc, UnicodeEncodeError):
             return ((exc.end-exc.start)*u"?", exc.end)
         elif isinstance(exc, UnicodeDecodeError):
             return (u"\\ufffd", exc.end)
         elif isinstance(exc, UnicodeTranslateError):
             return ((exc.end-exc.start)*u"\\ufffd", exc.end)
         else:
             raise TypeError("can't handle %s" % exc.__name__)

    def backslashreplace(exc):
         if isinstance(exc,
             (UnicodeEncodeError, UnicodeTranslateError)):
             s = u""
             for c in exc.object[exc.start:exc.end]:
                if ord(c)<=0xff:
                    s += u"\\x%02x" % ord(c)
                elif ord(c)<=0xffff:
                    s += u"\\u%04x" % ord(c)
                else:
                    s += u"\\U%08x" % ord(c)
             return (s, exc.end)
         else:
             raise TypeError("can't handle %s" % exc.__name__)

    def xmlcharrefreplace(exc):
         if isinstance(exc,
             (UnicodeEncodeError, UnicodeTranslateError)):
             s = u""
             for c in exc.object[exc.start:exc.end]:
                s += u"&#%d;" % ord(c)
             return (s, exc.end)
         else:
             raise TypeError("can't handle %s" % exc.__name__)

These five callback handlers will also be accessible to Python as
codecs.strict_error, codecs.ignore_error, codecs.replace_error,
codecs.backslashreplace_error and codecs.xmlcharrefreplace_error.

Rationale

Most legacy encoding do not support the full range of Unicode
characters. For these cases many high level protocols support a way of
escaping a Unicode character (e.g. Python itself supports the \x, \u and
\U convention, XML supports character references via &#xxx; etc.).

When implementing such an encoding algorithm, a problem with the current
implementation of the encode method of Unicode objects becomes apparent:
For determining which characters are unencodable by a certain encoding,
every single character has to be tried, because encode does not provide
any information about the location of the error(s), so

    # (1)
    us = u"xxx"
    s = us.encode(encoding)

has to be replaced by

    # (2)
    us = u"xxx"
    v = []
    for c in us:
        try:
            v.append(c.encode(encoding))
        except UnicodeError:
            v.append("&#%d;" % ord(c))
    s = "".join(v)

This slows down encoding dramatically as now the loop through the string
is done in Python code and no longer in C code.

Furthermore, this solution poses problems with stateful encodings. For
example, UTF-16 uses a Byte Order Mark at the start of the encoded byte
string to specify the byte order. Using (2) with UTF-16, results in an 8
bit string with a BOM between every character.

To work around this problem, a stream writer - which keeps state between
calls to the encoding function - has to be used:

    # (3)
    us = u"xxx"
    import codecs, cStringIO as StringIO
    writer = codecs.getwriter(encoding)

    v = StringIO.StringIO()
    uv = writer(v)
    for c in us:
        try:
            uv.write(c)
        except UnicodeError:
            uv.write(u"&#%d;" % ord(c))
    s = v.getvalue()

To compare the speed of (1) and (3) the following test script has been
used:

    # (4)
    import time
    us = u"äa"*1000000
    encoding = "ascii"
    import codecs, cStringIO as StringIO

    t1 = time.time()

    s1 = us.encode(encoding, "replace")

    t2 = time.time()

    writer = codecs.getwriter(encoding)

    v = StringIO.StringIO()
    uv = writer(v)
    for c in us:
        try:
            uv.write(c)
        except UnicodeError:
            uv.write(u"?")
    s2 = v.getvalue()

    t3 = time.time()

    assert(s1==s2)
    print "1:", t2-t1
    print "2:", t3-t2
    print "factor:", (t3-t2)/(t2-t1)

On Linux this gives the following output (with Python 2.3a0):

    1: 0.274321913719
    2: 51.1284689903
    factor: 186.381278466

i.e. (3) is 180 times slower than (1).

Callbacks must be stateless, because as soon as a callback is registered
it is available globally and can be called by multiple encode() calls.
To be able to use stateful callbacks, the errors parameter for
encode/decode/translate would have to be changed from char * to
PyObject *, so that the callback could be used directly, without the
need to register the callback globally. As this requires changes to lots
of C prototypes, this approach was rejected.

Currently all encoding/decoding functions have arguments

    const Py_UNICODE *p, int size

or

    const char *p, int size

to specify the unicode characters/8bit characters to be encoded/decoded.
So in case of an error the codec has to create a new unicode or str
object from these parameters and store it in the exception object. The
callers of these encoding/decoding functions extract these parameters
from str/unicode objects themselves most of the time, so it could speed
up error handling if these object were passed directly. As this again
requires changes to many C functions, this approach has been rejected.

For stream readers/writers the errors attribute must be changeable to be
able to switch between different error handling methods during the
lifetime of the stream reader/writer. This is currently the case for
codecs.StreamReader and codecs.StreamWriter and all their subclasses.
All core codecs and probably most of the third party codecs (e.g.
JapaneseCodecs) derive their stream readers/writers from these classes
so this already works, but the attribute errors should be documented as
a requirement.

Implementation Notes

A sample implementation is available as SourceForge patch #432401 [2]
including a script for testing the speed of various
string/encoding/error combinations and a test script.

Currently the new exception classes are old style Python classes. This
means that accessing attributes results in a dict lookup. The C API is
implemented in a way that makes it possible to switch to new style
classes behind the scene, if Exception (and UnicodeError) will be
changed to new style classes implemented in C for improved performance.

The class codecs.StreamReaderWriter uses the errors parameter for both
reading and writing. To be more flexible this should probably be changed
to two separate parameters for reading and writing.

The errors parameter of PyUnicode_TranslateCharmap is not availably to
Python, which makes testing of the new functionality of
PyUnicode_TranslateCharmap impossible with Python scripts. The patch
should add an optional argument errors to unicode.translate to expose
the functionality and make testing possible.

Codecs that do something different than encoding/decoding from/to
unicode and want to use the new machinery can define their own exception
classes and the strict handlers will automatically work with it. The
other predefined error handlers are unicode specific and expect to get a
Unicode(Encode|Decode|Translate)Error exception object so they won't
work.

Backwards Compatibility

The semantics of unicode.encode with errors="replace" has changed: The
old version always stored a ? character in the output string even if no
character was mapped to ? in the mapping. With the proposed patch, the
replacement string from the callback will again be looked up in the
mapping dictionary. But as all supported encodings are ASCII based, and
thus map ? to ?, this should not be a problem in practice.

Illegal values for the errors argument raised ValueError before, now
they will raise LookupError.

References

Copyright

This document has been placed in the public domain.

[1] SF feature request #403100 "Multicharacter replacements in
PyUnicode_TranslateCharmap" https://bugs.python.org/issue403100

[2] SF patch #432401 "unicode encoding error callbacks"
https://bugs.python.org/issue432401