PEP: 305 Title: CSV File API Version: $Revision$ Last-Modified: $Date$
Author: Kevin Altis <altis@semi-retired.com>, Dave Cole
<djc@object-craft.com.au>, Andrew McNamara
<andrewm@object-craft.com.au>, Skip Montanaro <skip@pobox.com>, Cliff
Wells <LogiplexSoftware@earthlink.net> Discussions-To: csv@python.org
Status: Final Type: Standards Track Content-Type: text/x-rst Created:
26-Jan-2003 Python-Version: 2.3 Post-History: 31-Jan-2003, 13-Feb-2003

Abstract

The Comma Separated Values (CSV) file format is the most common import
and export format for spreadsheets and databases. Although many CSV
files are simple to parse, the format is not formally defined by a
stable specification and is subtle enough that parsing lines of a CSV
file with something like line.split(",") is eventually bound to fail.
This PEP defines an API for reading and writing CSV files. It is
accompanied by a corresponding module which implements the API.

To Do (Notes for the Interested and Ambitious)

-   Better motivation for the choice of passing a file object to the
    constructors. See
    https://mail.python.org/pipermail/csv/2003-January/000179.html
-   Unicode. ugh.

Application Domain

This PEP is about doing one thing well: parsing tabular data which may
use a variety of field separators, quoting characters, quote escape
mechanisms and line endings. The authors intend the proposed module to
solve this one parsing problem efficiently. The authors do not intend to
address any of these related topics:

-   data interpretation (is a field containing the string "10" supposed
    to be a string, a float or an int? is it a number in base 10, base
    16 or base 2? is a number in quotes a number or a string?)
-   locale-specific data representation (should the number 1.23 be
    written as "1.23" or "1,23" or "1 23"?) -- this may eventually be
    addressed.
-   fixed width tabular data - can already be parsed reliably.

Rationale

Often, CSV files are formatted simply enough that you can get by reading
them line-by-line and splitting on the commas which delimit the fields.
This is especially true if all the data being read is numeric. This
approach may work for a while, then come back to bite you in the butt
when somebody puts something unexpected in the data like a comma. As you
dig into the problem you may eventually come to the conclusion that you
can solve the problem using regular expressions. This will work for a
while, then break mysteriously one day. The problem grows, so you dig
deeper and eventually realize that you need a purpose-built parser for
the format.

CSV formats are not well-defined and different implementations have a
number of subtle corner cases. It has been suggested that the "V" in the
acronym stands for "Vague" instead of "Values". Different delimiters and
quoting characters are just the start. Some programs generate whitespace
after each delimiter which is not part of the following field. Others
quote embedded quoting characters by doubling them, others by prefixing
them with an escape character. The list of weird ways to do things can
seem endless.

All this variability means it is difficult for programmers to reliably
parse CSV files from many sources or generate CSV files designed to be
fed to specific external programs without a thorough understanding of
those sources and programs. This PEP and the software which accompany it
attempt to make the process less fragile.

Existing Modules

This problem has been tackled before. At least three modules currently
available in the Python community enable programmers to read and write
CSV files:

-   Object Craft's CSV module[1]
-   Cliff Wells' Python-DSV module[2]
-   Laurence Tratt's ASV module[3]

Each has a different API, making it somewhat difficult for programmers
to switch between them. More of a problem may be that they interpret
some of the CSV corner cases differently, so even after surmounting the
differences between the different module APIs, the programmer has to
also deal with semantic differences between the packages.

Module Interface

This PEP supports three basic APIs, one to read and parse CSV files, one
to write them, and one to identify different CSV dialects to the readers
and writers.

Reading CSV Files

CSV readers are created with the reader factory function:

    obj = reader(iterable [, dialect='excel']
                 [optional keyword args])

A reader object is an iterator which takes an iterable object returning
lines as the sole required parameter. If it supports a binary mode (file
objects do), the iterable argument to the reader function must have been
opened in binary mode. This gives the reader object full control over
the interpretation of the file's contents. The optional dialect
parameter is discussed below. The reader function also accepts several
optional keyword arguments which define specific format settings for the
parser (see the section "Formatting Parameters"). Readers are typically
used as follows:

    csvreader = csv.reader(file("some.csv"))
    for row in csvreader:
        process(row)

Each row returned by a reader object is a list of strings or Unicode
objects.

When both a dialect parameter and individual formatting parameters are
passed to the constructor, first the dialect is queried for formatting
parameters, then individual formatting parameters are examined.

Writing CSV Files

Creating writers is similar:

    obj = writer(fileobj [, dialect='excel'],
                 [optional keyword args])

A writer object is a wrapper around a file-like object opened for
writing in binary mode (if such a distinction is made). It accepts the
same optional keyword parameters as the reader constructor.

Writers are typically used as follows:

    csvwriter = csv.writer(file("some.csv", "w"))
    for row in someiterable:
        csvwriter.writerow(row)

To generate a set of field names as the first row of the CSV file, the
programmer must explicitly write it, e.g.:

    csvwriter = csv.writer(file("some.csv", "w"), fieldnames=names)
    csvwriter.write(names)
    for row in someiterable:
        csvwriter.write(row)

or arrange for it to be the first row in the iterable being written.

Managing Different Dialects

Because CSV is a somewhat ill-defined format, there are plenty of ways
one CSV file can differ from another, yet contain exactly the same data.
Many tools which can import or export tabular data allow the user to
indicate the field delimiter, quote character, line terminator, and
other characteristics of the file. These can be fairly easily
determined, but are still mildly annoying to figure out, and make for
fairly long function calls when specified individually.

To try and minimize the difficulty of figuring out and specifying a
bunch of formatting parameters, reader and writer objects support a
dialect argument which is just a convenient handle on a group of these
lower level parameters. When a dialect is given as a string it
identifies one of the dialects known to the module via its registration
functions, otherwise it must be an instance of the Dialect class as
described below.

Dialects will generally be named after applications or organizations
which define specific sets of format constraints. Two dialects are
defined in the module as of this writing, "excel", which describes the
default format constraints for CSV file export by Excel 97 and Excel
2000, and "excel-tab", which is the same as "excel" but specifies an
ASCII TAB character as the field delimiter.

Dialects are implemented as attribute only classes to enable users to
construct variant dialects by subclassing. The "excel" dialect is a
subclass of Dialect and is defined as follows:

    class Dialect:
        # placeholders
        delimiter = None
        quotechar = None
        escapechar = None
        doublequote = None
        skipinitialspace = None
        lineterminator = None
        quoting = None

    class excel(Dialect):
        delimiter = ','
        quotechar = '"'
        doublequote = True
        skipinitialspace = False
        lineterminator = '\r\n'
        quoting = QUOTE_MINIMAL

The "excel-tab" dialect is defined as:

    class exceltsv(excel):
        delimiter = '\t'

(For a description of the individual formatting parameters see the
section "Formatting Parameters".)

To enable string references to specific dialects, the module defines
several functions:

    dialect = get_dialect(name)
    names = list_dialects()
    register_dialect(name, dialect)
    unregister_dialect(name)

get_dialect() returns the dialect instance associated with the given
name. list_dialects() returns a list of all registered dialect names.
register_dialects() associates a string name with a dialect class.
unregister_dialect() deletes a name/dialect association.

Formatting Parameters

In addition to the dialect argument, both the reader and writer
constructors take several specific formatting parameters, specified as
keyword parameters. The formatting parameters understood are:

-   quotechar specifies a one-character string to use as the quoting
    character. It defaults to '"'. Setting this to None has the same
    effect as setting quoting to csv.QUOTE_NONE.
-   delimiter specifies a one-character string to use as the field
    separator. It defaults to ','.
-   escapechar specifies a one-character string used to escape the
    delimiter when quotechar is set to None.
-   skipinitialspace specifies how to interpret whitespace which
    immediately follows a delimiter. It defaults to False, which means
    that whitespace immediately following a delimiter is part of the
    following field.
-   lineterminator specifies the character sequence which should
    terminate rows.
-   quoting controls when quotes should be generated by the writer. It
    can take on any of the following module constants:
    -   csv.QUOTE_MINIMAL means only when required, for example, when a
        field contains either the quotechar or the delimiter
    -   csv.QUOTE_ALL means that quotes are always placed around fields.
    -   csv.QUOTE_NONNUMERIC means that quotes are always placed around
        nonnumeric fields.
    -   csv.QUOTE_NONE means that quotes are never placed around fields.
-   doublequote controls the handling of quotes inside fields. When True
    two consecutive quotes are interpreted as one during read, and when
    writing, each quote is written as two quotes.

When processing a dialect setting and one or more of the other optional
parameters, the dialect parameter is processed before the individual
formatting parameters. This makes it easy to choose a dialect, then
override one or more of the settings without defining a new dialect
class. For example, if a CSV file was generated by Excel 2000 using
single quotes as the quote character and a colon as the delimiter, you
could create a reader like:

    csvreader = csv.reader(file("some.csv"), dialect="excel",
                           quotechar="'", delimiter=':')

Other details of how Excel generates CSV files would be handled
automatically because of the reference to the "excel" dialect.

Reader Objects

Reader objects are iterables whose next() method returns a sequence of
strings, one string per field in the row.

Writer Objects

Writer objects have two methods, writerow() and writerows(). The former
accepts an iterable (typically a list) of fields which are to be written
to the output. The latter accepts a list of iterables and calls
writerow() for each.

Implementation

There is a sample implementation available.[4] The goal is for it to
efficiently implement the API described in the PEP. It is heavily based
on the Object Craft csv module.[5]

Testing

The sample implementation[6] includes a set of test cases.

Issues

1.  Should a parameter control how consecutive delimiters are
    interpreted? Our thought is "no". Consecutive delimiters should
    always denote an empty field.

2.  What about Unicode? Is it sufficient to pass a file object gotten
    from codecs.open()? For example:

        csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252"))

        csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))

    In the first example, text would be assumed to be encoded as cp1252.
    Should the system be aggressive in converting to Unicode or should
    Unicode strings only be returned if necessary?

    In the second example, the file will take care of automatically
    encoding Unicode strings as utf-8 before writing to disk.

    Note: As of this writing, the csv module doesn't handle Unicode
    data.

3.  What about alternate escape conventions? If the dialect in use
    includes an escapechar parameter which is not None and the quoting
    parameter is set to QUOTE_NONE, delimiters appearing within fields
    will be prefixed by the escape character when writing and are
    expected to be prefixed by the escape character when reading.

4.  Should there be a "fully quoted" mode for writing? What about "fully
    quoted except for numeric values"? Both are implemented (QUOTE_ALL
    and QUOTE_NONNUMERIC, respectively).

5.  What about end-of-line? If I generate a CSV file on a Unix system,
    will Excel properly recognize the LF-only line terminators? Files
    must be opened for reading or writing as appropriate using binary
    mode. Specify the lineterminator sequence as '\r\n'. The resulting
    file will be written correctly.

6.  What about an option to generate dicts from the reader and accept
    dicts by the writer? See the DictReader and DictWriter classes in
    csv.py.

7.  Are quote character and delimiters limited to single characters? For
    the time being, yes.

8.  How should rows of different lengths be handled? Interpretation of
    the data is the application's job. There is no such thing as a
    "short row" or a "long row" at this level.

References

There are many references to other CSV-related projects on the Web. A
few are included here.

Copyright

This document has been placed in the public domain.



  Local Variables: mode: indented-text indent-tabs-mode: nil
  sentence-end-double-space: t fill-column: 70 End:

[1] csv module, Object Craft
(http://www.object-craft.com.au/projects/csv)

[2] Python-DSV module, Wells
(http://sourceforge.net/projects/python-dsv/)

[3] ASV module, Tratt (http://tratt.net/laurie/python/asv/)

[4] csv module, Python Sandbox
(http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/)

[5] csv module, Object Craft
(http://www.object-craft.com.au/projects/csv)

[6] csv module, Python Sandbox
(http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/)