PEP: 450 Title: Adding A Statistics Module To The Standard Library
Version: $Revision$ Last-Modified: $Date$ Author: Steven D'Aprano
<steve@pearwood.info> Status: Final Type: Standards Track Content-Type:
text/x-rst Created: 01-Aug-2013 Python-Version: 3.4 Post-History:
13-Sep-2013

Abstract

This PEP proposes the addition of a module for common statistics
functions such as mean, median, variance and standard deviation to the
Python standard library. See also http://bugs.python.org/issue18606

Rationale

The proposed statistics module is motivated by the "batteries included"
philosophy towards the Python standard library. Raymond Hettinger and
other senior developers have requested a quality statistics library that
falls somewhere in between high-end statistics libraries and ad hoc
code.[1] Statistical functions such as mean, standard deviation and
others are obvious and useful batteries, familiar to any Secondary
School student. Even cheap scientific calculators typically include
multiple statistical functions such as:

-   mean
-   population and sample variance
-   population and sample standard deviation
-   linear regression
-   correlation coefficient

Graphing calculators aimed at Secondary School students typically
include all of the above, plus some or all of:

-   median
-   mode
-   functions for calculating the probability of random variables from
    the normal, t, chi-squared, and F distributions
-   inference on the mean

and others[2]. Likewise spreadsheet applications such as Microsoft
Excel, LibreOffice and Gnumeric include rich collections of statistical
functions[3].

In contrast, Python currently has no standard way to calculate even the
simplest and most obvious statistical functions such as mean. For those
who need statistical functions in Python, there are two obvious
solutions:

-   install numpy and/or scipy[4];
-   or use a Do It Yourself solution.

Numpy is perhaps the most full-featured solution, but it has a few
disadvantages:

-   It may be overkill for many purposes. The documentation for numpy
    even warns

      "It can be hard to know what functions are available in numpy.
      This is not a complete list, but it does cover most of them."[5]

    and then goes on to list over 270 functions, only a small number of
    which are related to statistics.

-   Numpy is aimed at those doing heavy numerical work, and may be
    intimidating to those who don't have a background in computational
    mathematics and computer science. For example, numpy.mean takes four
    arguments:

        mean(a, axis=None, dtype=None, out=None)

    although fortunately for the beginner or casual numpy user, three
    are optional and numpy.mean does the right thing in simple cases:

        >>>  numpy.mean([1, 2, 3, 4])
        2.5

-   For many people, installing numpy may be difficult or impossible.
    For example, people in corporate environments may have to go through
    a difficult, time-consuming process before being permitted to
    install third-party software. For the casual Python user, having to
    learn about installing third-party packages in order to average a
    list of numbers is unfortunate.

This leads to option number 2, DIY statistics functions. At first
glance, this appears to be an attractive option, due to the apparent
simplicity of common statistical functions. For example:

    def mean(data):
        return sum(data)/len(data)

    def variance(data):
        # Use the Computational Formula for Variance.
        n = len(data)
        ss = sum(x**2 for x in data) - (sum(data)**2)/n
        return ss/(n-1)

    def standard_deviation(data):
        return math.sqrt(variance(data))

The above appears to be correct with a casual test:

    >>> data = [1, 2, 4, 5, 8]
    >>> variance(data)
    7.5

But adding a constant to every data point should not change the
variance:

    >>> data = [x+1e12 for x in data]
    >>> variance(data)
    0.0

And variance should never be negative:

    >>> variance(data*100)
    -1239429440.1282566

By contrast, the proposed reference implementation gets the exactly
correct answer 7.5 for the first two examples, and a reasonably close
answer for the third: 6.012. numpy does no better[6].

Even simple statistical calculations contain traps for the unwary,
starting with the Computational Formula itself. Despite the name, it is
numerically unstable and can be extremely inaccurate, as can be seen
above. It is completely unsuitable for computation by computer[7]. This
problem plagues users of many programming language, not just Python[8],
as coders reinvent the same numerically inaccurate code over and over
again[9], or advise others to do so[10].

It isn't just the variance and standard deviation. Even the mean is not
quite as straightforward as it might appear. The above implementation
seems too simple to have problems, but it does:

-   The built-in sum can lose accuracy when dealing with floats of
    wildly differing magnitude. Consequently, the above naive mean fails
    this "torture test":

        assert mean([1e30, 1, 3, -1e30]) == 1

    returning 0 instead of 1, a purely computational error of 100%.

-   Using math.fsum inside mean will make it more accurate with float
    data, but it also has the side-effect of converting any arguments to
    float even when unnecessary. E.g. we should expect the mean of a
    list of Fractions to be a Fraction, not a float.

While the above mean implementation does not fail quite as
catastrophically as the naive variance does, a standard library function
can do much better than the DIY versions.

The example above involves an especially bad set of data, but even for
more realistic data sets accuracy is important. The first step in
interpreting variation in data (including dealing with ill-conditioned
data) is often to standardize it to a series with variance 1 (and often
mean 0). This standardization requires accurate computation of the mean
and variance of the raw series. Naive computation of mean and variance
can lose precision very quickly. Because precision bounds accuracy, it
is important to use the most precise algorithms for computing mean and
variance that are practical, or the results of standardization are
themselves useless.

Comparison To Other Languages/Packages

The proposed statistics library is not intended to be a competitor to
such third-party libraries as numpy/scipy, or of proprietary
full-featured statistics packages aimed at professional statisticians
such as Minitab, SAS and Matlab. It is aimed at the level of graphing
and scientific calculators.

Most programming languages have little or no built-in support for
statistics functions. Some exceptions:

R

R (and its proprietary cousin, S) is a programming language designed for
statistics work. It is extremely popular with statisticians and is
extremely feature-rich[11].

C#

The C# LINQ package includes extension methods to calculate the average
of enumerables[12].

Ruby

Ruby does not ship with a standard statistics module, despite some
apparent demand[13]. Statsample appears to be a feature-rich third-party
library, aiming to compete with R[14].

PHP

PHP has an extremely feature-rich (although mostly undocumented) set of
advanced statistical functions[15].

Delphi

Delphi includes standard statistical functions including Mean, Sum,
Variance, TotalVariance, MomentSkewKurtosis in its Math library[16].

GNU Scientific Library

The GNU Scientific Library includes standard statistical functions,
percentiles, median and others[17]. One innovation I have borrowed from
the GSL is to allow the caller to optionally specify the pre-calculated
mean of the sample (or an a priori known population mean) when
calculating the variance and standard deviation[18].

Design Decisions Of The Module

My intention is to start small and grow the library as needed, rather
than try to include everything from the start. Consequently, the current
reference implementation includes only a small number of functions:
mean, variance, standard deviation, median, mode. (See the reference
implementation for a full list.)

I have aimed for the following design features:

-   Correctness over speed. It is easier to speed up a correct but slow
    function than to correct a fast but buggy one.
-   Concentrate on data in sequences, allowing two-passes over the data,
    rather than potentially compromise on accuracy for the sake of a
    one-pass algorithm. Functions expect data will be passed as a list
    or other sequence; if given an iterator, they may internally convert
    to a list.
-   Functions should, as much as possible, honour any type of numeric
    data. E.g. the mean of a list of Decimals should be a Decimal, not a
    float. When this is not possible, treat float as the "lowest common
    data type".
-   Although functions support data sets of floats, Decimals or
    Fractions, there is no guarantee that mixed data sets will be
    supported. (But on the other hand, they aren't explicitly rejected
    either.)
-   Plenty of documentation, aimed at readers who understand the basic
    concepts but may not know (for example) which variance they should
    use (population or sample?). Mathematicians and statisticians have a
    terrible habit of being inconsistent with both notation and
    terminology[19], and having spent many hours making sense of the
    contradictory/confusing definitions in use, it is only fair that I
    do my best to clarify rather than obfuscate the topic.
-   But avoid going into tedious[20] mathematical detail.

API

The initial version of the library will provide univariate (single
variable) statistics functions. The general API will be based on a
functional model function(data, ...) -> result, where data is a
mandatory iterable of (usually) numeric data.

The author expects that lists will be the most common data type used,
but any iterable type should be acceptable. Where necessary, functions
may convert to lists internally. Where possible, functions are expected
to conserve the type of the data values, for example, the mean of a list
of Decimals should be a Decimal rather than float.

Calculating mean, median and mode

The mean, median* and mode functions take a single mandatory argument
and return the appropriate statistic, e.g.:

    >>> mean([1, 2, 3])
    2.0

Functions provided are:

-   

    mean(data)

        arithmetic mean of data.

-   

    median(data)

        median (middle value) of data, taking the average of the two
        middle values when there are an even number of values.

-   

    median_high(data)

        high median of data, taking the larger of the two middle values
        when the number of items is even.

-   

    median_low(data)

        low median of data, taking the smaller of the two middle values
        when the number of items is even.

-   

    median_grouped(data, interval=1)

        50th percentile of grouped data, using interpolation.

-   

    mode(data)

        most common data point.

mode is the sole exception to the rule that the data argument must be
numeric. It will also accept an iterable of nominal data, such as
strings.

Calculating variance and standard deviation

In order to be similar to scientific calculators, the statistics module
will include separate functions for population and sample variance and
standard deviation. All four functions have similar signatures, with a
single mandatory argument, an iterable of numeric data, e.g.:

    >>> variance([1, 2, 2, 2, 3])
    0.5

All four functions also accept a second, optional, argument, the mean of
the data. This is modelled on a similar API provided by the GNU
Scientific Library[21]. There are three use-cases for using this
argument, in no particular order:

1)  The value of the mean is known a priori.
2)  You have already calculated the mean, and wish to avoid calculating
    it again.
3)  You wish to (ab)use the variance functions to calculate the second
    moment about some given point other than the mean.

In each case, it is the caller's responsibility to ensure that given
argument is meaningful.

Functions provided are:

-   

    variance(data, xbar=None)

        sample variance of data, optionally using xbar as the sample
        mean.

-   

    stdev(data, xbar=None)

        sample standard deviation of data, optionally using xbar as the
        sample mean.

-   

    pvariance(data, mu=None)

        population variance of data, optionally using mu as the
        population mean.

-   

    pstdev(data, mu=None)

        population standard deviation of data, optionally using mu as
        the population mean.

Other functions

There is one other public function:

-   

    sum(data, start=0)

        high-precision sum of numeric data.

Specification

As the proposed reference implementation is in pure Python, other Python
implementations can easily make use of the module unchanged, or adapt it
as they see fit.

What Should Be The Name Of The Module?

This will be a top-level module statistics.

There was some interest in turning math into a package, and making this
a sub-module of math, but the general consensus eventually agreed on a
top-level module. Other potential but rejected names included stats (too
much risk of confusion with existing stat module), and statslib
(described as "too C-like").

Discussion And Resolved Issues

This proposal has been previously discussed here[22].

A number of design issues were resolved during the discussion on
Python-Ideas and the initial code review. There was a lot of concern
about the addition of yet another sum function to the standard library,
see the FAQs below for more details. In addition, the initial
implementation of sum suffered from some rounding issues and other
design problems when dealing with Decimals. Oscar Benjamin's assistance
in resolving this was invaluable.

Another issue was the handling of data in the form of iterators. The
first implementation of variance silently swapped between a one- and
two-pass algorithm, depending on whether the data was in the form of an
iterator or sequence. This proved to be a design mistake, as the
calculated variance could differ slightly depending on the algorithm
used, and variance etc. were changed to internally generate a list and
always use the more accurate two-pass implementation.

One controversial design involved the functions to calculate median,
which were implemented as attributes on the median callable, e.g.
median, median.low, median.high etc. Although there is at least one
existing use of this style in the standard library, in unittest.mock,
the code reviewers felt that this was too unusual for the standard
library. Consequently, the design has been changed to a more traditional
design of separate functions with a pseudo-namespace naming convention,
median_low, median_high, etc.

Another issue that was of concern to code reviewers was the existence of
a function calculating the sample mode of continuous data, with some
people questioning the choice of algorithm, and whether it was a
sufficiently common need to be included. So it was dropped from the API,
and mode now implements only the basic schoolbook algorithm based on
counting unique values.

Another significant point of discussion was calculating statistics of
timedelta objects. Although the statistics module will not directly
support timedelta objects, it is possible to support this use-case by
converting them to numbers first using the timedelta.total_seconds
method.

Frequently Asked Questions

Shouldn't this module spend time on PyPI before being considered for the standard library?

Older versions of this module have been available on PyPI[23] since
2010. Being much simpler than numpy, it does not require many years of
external development.

Does the standard library really need yet another version of sum?

This proved to be the most controversial part of the reference
implementation. In one sense, clearly three sums is two too many. But in
another sense, yes. The reasons why the two existing versions are
unsuitable are described here[24] but the short summary is:

-   the built-in sum can lose precision with floats;
-   the built-in sum accepts any non-numeric data type that supports
    the + operator, apart from strings and bytes;
-   math.fsum is high-precision, but coerces all arguments to float.

There was some interest in "fixing" one or the other of the existing
sums. If this occurs before 3.4 feature-freeze, the decision to keep
statistics.sum can be re-considered.

Will this module be backported to older versions of Python?

The module currently targets 3.3, and I will make it available on PyPI
for 3.3 for the foreseeable future. Backporting to older versions of the
3.x series is likely (but not yet decided). Backporting to 2.7 is less
likely but not ruled out.

Is this supposed to replace numpy?

No. While it is likely to grow over the years (see open issues below) it
is not aimed to replace, or even compete directly with, numpy. Numpy is
a full-featured numeric library aimed at professionals, the nuclear
reactor of numeric libraries in the Python ecosystem. This is just a
battery, as in "batteries included", and is aimed at an intermediate
level somewhere between "use numpy" and "roll your own version".

Future Work

-   At this stage, I am unsure of the best API for multivariate
    statistical functions such as linear regression, correlation
    coefficient, and covariance. Possible APIs include:

    -   Separate arguments for x and y data:

            function([x0, x1, ...], [y0, y1, ...])

    -   A single argument for (x, y) data:

            function([(x0, y0), (x1, y1), ...])

        This API is preferred by GvR[25].

    -   Selecting arbitrary columns from a 2D array:

            function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)

    -   Some combination of the above.

    In the absence of a consensus of preferred API for multivariate
    stats, I will defer including such multivariate functions until
    Python 3.5.

-   Likewise, functions for calculating probability of random variables
    and inference testing (e.g. Student's t-test) will be deferred until
    3.5.

-   There is considerable interest in including one-pass functions that
    can calculate multiple statistics from data in iterator form,
    without having to convert to a list. The experimental stats package
    on PyPI includes co-routine versions of statistics functions.
    Including these will be deferred to 3.5.

References

Copyright

This document has been placed in the public domain.

[1] https://mail.python.org/pipermail/python-dev/2010-October/104721.html

[2] http://support.casio.com/pdf/004/CP330PLUSver310_Soft_E.pdf

[3] Gnumeric:: https://projects.gnome.org/gnumeric/functions.shtml

LibreOffice:
https://help.libreoffice.org/Calc/Statistical_Functions_Part_One
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Two
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Three
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Four
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Five

[4] Scipy: http://scipy-central.org/ Numpy: http://www.numpy.org/

[5] http://wiki.scipy.org/Numpy_Functions_by_Category

[6] Tested with numpy 1.6.1 and Python 2.7.

[7] http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/

[8] http://rosettacode.org/wiki/Standard_deviation

[9] https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py

[10] http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration

[11] http://www.r-project.org/

[12] http://msdn.microsoft.com/en-us/library/system.linq.enumerable.average.aspx

[13] https://www.bcg.wisc.edu/webteam/support/ruby/standard_deviation

[14] http://ruby-statsample.rubyforge.org/

[15] http://www.php.net/manual/en/ref.stats.php

[16] http://www.ayton.id.au/gary/it/Delphi/D_maths.htm#Delphi%20Statistical%20functions.

[17] http://www.gnu.org/software/gsl/manual/html_node/Statistics.html

[18] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html

[19] http://mathworld.wolfram.com/Skewness.html

[20] At least, tedious to those who don't like this sort of thing.

[21] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html

[22] https://mail.python.org/pipermail/python-ideas/2011-September/011524.html

[23] https://pypi.python.org/pypi/stats/

[24] https://mail.python.org/pipermail/python-ideas/2013-August/022630.html

[25] https://mail.python.org/pipermail/python-dev/2013-September/128429.html