PEP: 278 Title: Universal Newline Support Author: Jack Jansen
<jack@cwi.nl> Status: Final Type: Standards Track Content-Type:
text/x-rst Created: 14-Jan-2002 Python-Version: 2.3 Post-History:

Abstract

This PEP discusses a way in which Python can support I/O on files which
have a newline format that is not the native format on the platform, so
that Python on each platform can read and import files with CR
(Macintosh), LF (Unix) or CR LF (Windows) line endings.

It is more and more common to come across files that have an end of line
that does not match the standard on the current platform: files
downloaded over the net, remotely mounted filesystems on a different
platform, Mac OS X with its double standard of Mac and Unix line
endings, etc.

Many tools such as editors and compilers already handle this gracefully,
it would be good if Python did so too.

Specification

Universal newline support is enabled by default, but can be disabled
during the configure of Python.

In a Python with universal newline support the feature is automatically
enabled for all import statements and execfile() calls. There is no
special support for eval() or exec.

In a Python with universal newline support open() the mode parameter can
also be "U", meaning "open for input as a text file with universal
newline interpretation". Mode "rU" is also allowed, for symmetry with
"rb". Mode "U" cannot be combined with other mode flags such as "+". Any
line ending in the input file will be seen as a '\n' in Python, so
little other code has to change to handle universal newlines.

Conversion of newlines happens in all calls that read data: read(),
readline(), readlines(), etc.

There is no special support for output to file with a different newline
convention, and so mode "wU" is also illegal.

A file object that has been opened in universal newline mode gets a new
attribute "newlines" which reflects the newline convention used in the
file. The value for this attribute is one of None (no newline read yet),
"\r", "\n", "\r\n" or a tuple containing all the newline types seen.

Rationale

Universal newline support is implemented in C, not in Python. This is
done because we want files with a foreign newline convention to be
import-able, so a Python Lib directory can be shared over a remote file
system connection, or between MacPython and Unix-Python on Mac OS X. For
this to be feasible the universal newline convention needs to have a
reasonably small impact on performance, which means a Python
implementation is not an option as it would bog down all imports. And
because of files with multiple newline conventions, which Visual C++ and
other Windows tools will happily produce, doing a quick check for the
newlines used in a file (handing off the import to C code if a
platform-local newline is seen) will not work. Finally, a C
implementation also allows tracebacks and such (which open the Python
source module) to be handled easily.

There is no output implementation of universal newlines, Python programs
are expected to handle this by themselves or write files with
platform-local convention otherwise. The reason for this is that input
is the difficult case, outputting different newlines to a file is
already easy enough in Python.

Also, an output implementation would be much more difficult than an
input implementation, surprisingly: a lot of output is done through
PyXXX_Print() methods, and at this point the file object is not
available anymore, only a FILE *. So, an output implementation would
need to somehow go from the FILE* to the file object, because that is
where the current newline delimiter is stored.

The input implementation has no such problem: there are no cases in the
Python source tree where files are partially read from C, partially from
Python, and such cases are expected to be rare in extension modules. If
such cases exist the only problem is that the newlines attribute of the
file object is not updated during the fread() or fgets() calls that are
done direct from C.

A partial output implementation, where strings passed to fp.write()
would be converted to use fp.newlines as their line terminator but all
other output would not is far too surprising, in my view.

Because there is no output support for universal newlines there is also
no support for a mode "rU+": the surprise factor of the previous
paragraph would hold to an even stronger degree.

There is no support for universal newlines in strings passed to eval()
or exec. It is envisioned that such strings always have the standard \n
line feed, if the strings come from a file that file can be read with
universal newlines.

I think there are no special issues with unicode. utf-16 shouldn't pose
any new problems, as such files need to be opened in binary mode anyway.
Interaction with utf-8 is fine too: values 0x0a and 0x0d cannot occur as
part of a multibyte sequence.

Universal newline files should work fine with iterators and xreadlines()
as these eventually call the normal file readline/readlines methods.

While universal newlines are automatically enabled for import they are
not for opening, where you have to specifically say open(..., "U"). This
is open to debate, but here are a few reasons for this design:

-   Compatibility. Programs which already do their own interpretation of
    \r\n in text files would break. Examples of such programs would be
    editors which warn you when you open a file with a different newline
    convention. If universal newlines was made the default such an
    editor would silently convert your line endings to the local
    convention on save. Programs which open binary files as text files
    on Unix would also break (but it could be argued they deserve it
    :-).
-   Interface clarity. Universal newlines are only supported for input
    files, not for input/output files, as the semantics would become
    muddy. Would you write Mac newlines if all reads so far had
    encountered Mac newlines? But what if you then later read a Unix
    newline?

The newlines attribute is included so that programs that really care
about the newline convention, such as text editors, can examine what was
in a file. They can then save (a copy of) the file with the same newline
convention (or, in case of a file with mixed newlines, ask the user what
to do, or output in platform convention).

Feedback is explicitly solicited on one item in the reference
implementation: whether or not the universal newlines routines should
grab the global interpreter lock. Currently they do not, but this could
be considered living dangerously, as they may modify fields in a
FileObject. But as these routines are replacements for fgets() and
fread() as well it may be difficult to decide whether or not the lock is
held when the routine is called. Moreover, the only danger is that if
two threads read the same FileObject at the same time an extraneous
newline may be seen or the newlines attribute may inadvertently be set
to mixed. I would argue that if you read the same FileObject in two
threads simultaneously you are asking for trouble anyway.

Note that no globally accessible pointers are manipulated in the fgets()
or fread() replacement routines, just some integer-valued flags, so the
chances of core dumps are zero (he said:-).

Universal newline support can be disabled during configure because it
does have a small performance penalty, and moreover the implementation
has not been tested on all conceivable platforms yet. It might also be
silly on some platforms (WinCE or Palm devices, for instance). If
universal newline support is not enabled then file objects do not have
the newlines attribute, so testing whether the current Python has it can
be done with a simple:

    if hasattr(open, 'newlines'):
       print 'We have universal newline support'

Note that this test uses the open() function rather than the file type
so that it won't fail for versions of Python where the file type was not
available (the file type was added to the built-in namespace in the same
release as the universal newline feature was added).

Additionally, note that this test fails again on Python versions >= 2.5,
when open() was made a function again and is not synonymous with the
file type anymore.

Reference Implementation

A reference implementation is available in SourceForge patch #476814:
https://bugs.python.org/issue476814

References

None.

Copyright

This document has been placed in the public domain.