PEP: 481 Title: Migrate CPython to Git, Github, and Phabricator Author:
Donald Stufft <donald@stufft.io> Status: Withdrawn Type: Process
Content-Type: text/x-rst Created: 29-Nov-2014 Post-History: 29-Nov-2014

Abstract

Note

This PEP has been withdrawn, if you're looking for the PEP documenting
the move to Github, please refer to PEP 512.

This PEP proposes migrating the repository hosting of CPython and the
supporting repositories to Git and Github. It also proposes adding
Phabricator as an alternative to Github Pull Requests to handle
reviewing changes. This particular PEP is offered as an alternative to
PEP 474 and PEP 462 which aims to achieve the same overall benefits but
restricts itself to tools that support Mercurial and are completely Open
Source.

Rationale

CPython is an open source project which relies on a number of volunteers
donating their time. As an open source project it relies on attracting
new volunteers as well as retaining existing ones in order to continue
to have a healthy amount of manpower available. In addition to
increasing the amount of manpower that is available to the project, it
also needs to allow for effective use of what manpower is available.

The current toolchain of the CPython project is a custom and unique
combination of tools which mandates a workflow that is similar to one
found in a lot of older projects, but which is becoming less and less
popular as time goes on.

The one-off nature of the CPython toolchain and workflow means that any
new contributor is going to need spend time learning the tools and
workflow before they can start contributing to CPython. Once a new
contributor goes through the process of learning the CPython workflow
they also are unlikely to be able to take that knowledge and apply it to
future projects they wish to contribute to. This acts as a barrier to
contribution which will scare off potential new contributors.

In addition the tooling that CPython uses is under-maintained,
antiquated, and it lacks important features that enable committers to
more effectively use their time when reviewing and approving changes.
The fact that it is under-maintained means that bugs are likely to last
for longer, if they ever get fixed, as well as it's more likely to go
down for extended periods of time. The fact that it is antiquated means
that it doesn't effectively harness the capabilities of the modern web
platform. Finally the fact that it lacks several important features such
as a lack of pre-testing commits and the lack of an automatic merge tool
means that committers have to do needless busy work to commit even the
simplest of changes.

Version Control System

The first decision that needs to be made is the VCS of the primary
server side repository. Currently the CPython repository, as well as a
number of supporting repositories, uses Mercurial. When evaluating the
VCS we must consider the capabilities of the VCS itself as well as the
network effect and mindshare of the community around that VCS.

There are really only two real options for this, Mercurial and Git.
Between the two of them the technical capabilities are largely
equivalent. For this reason this PEP will largely ignore the technical
arguments about the VCS system and will instead focus on the social
aspects.

It is not possible to get exact numbers for the number of projects or
people which are using a particular VCS, however we can infer this by
looking at several sources of information for what VCS projects are
using.

The Open Hub (previously Ohloh) statistics[1] show that 37% of the
repositories indexed by The Open Hub are using Git (second only to SVN
which has 48%) while Mercurial has just 2% (beating only bazaar which
has 1%). This has Git being just over 18 times as popular as Mercurial
on The Open Hub.

Another source of information on the popular of the difference VCSs is
PyPI itself. This source is more targeted at the Python community itself
since it represents projects developed for Python. Unfortunately PyPI
does not have a standard location for representing this information, so
this requires manual processing. If we limit our search to the top 100
projects on PyPI (ordered by download counts) we can see that 62% of
them use Git while 22% of them use Mercurial while 13% use something
else. This has Git being just under 3 times as popular as Mercurial for
the top 100 projects on PyPI.

Obviously from these numbers Git is by far the more popular DVCS for
open source projects and choosing the more popular VCS has a number of
positive benefits.

For new contributors it increases the likelihood that they will have
already learned the basics of Git as part of working with another
project or if they are just now learning Git, that they'll be able to
take that knowledge and apply it to other projects. Additionally a
larger community means more people writing how to guides, answering
questions, and writing articles about Git which makes it easier for a
new user to find answers and information about the tool they are trying
to learn.

Another benefit is that by nature of having a larger community, there
will be more tooling written around it. This increases options for
everything from GUI clients, helper scripts, repository hosting, etc.

Repository Hosting

This PEP proposes allowing GitHub Pull Requests to be submitted, however
GitHub does not have a way to submit Pull Requests against a repository
that is not hosted on GitHub. This PEP also proposes that in addition to
GitHub Pull Requests Phabricator's Differential app can also be used to
submit proposed changes and Phabricator does allow submitting changes
against a repository that is not hosted on Phabricator.

For this reason this PEP proposes using GitHub as the canonical location
of the repository with a read-only mirror located in Phabricator. If at
some point in the future GitHub is no longer desired, then repository
hosting can easily be moved to solely in Phabricator and the ability to
accept GitHub Pull Requests dropped.

In addition to hosting the repositories on Github, a read only copy of
all repositories will also be mirrored onto the PSF Infrastructure.

Code Review

Currently CPython uses a custom fork of Rietveld which has been modified
to not run on Google App Engine which is really only able to be
maintained currently by one person. In addition it is missing out on
features that are present in many modern code review tools.

This PEP proposes allowing both Github Pull Requests and Phabricator
changes to propose changes and review code. It suggests both so that
contributors can select which tool best enables them to submit changes,
and reviewers can focus on reviewing changes in the tooling they like
best.

GitHub Pull Requests

GitHub is a very popular code hosting site and is increasingly becoming
the primary place people look to contribute to a project. Enabling users
to contribute through GitHub is enabling contributors to contribute
using tooling that they are likely already familiar with and if they are
not they are likely to be able to apply to another project.

GitHub Pull Requests have a fairly major advantage over the older
"submit a patch to a bug tracker" model. It allows developers to work
completely within their VCS using standard VCS tooling so it does not
require creating a patch file and figuring out what the right location
is to upload it to. This lowers the barrier for sending a change to be
reviewed.

On the reviewing side, GitHub Pull Requests are far easier to review,
they have nice syntax highlighted diffs which can operate in either
unified or side by side views. They allow expanding the context on a
diff up to and including the entire file. Finally they allow commenting
inline and on the pull request as a whole and they present that in a
nice unified way which will also hide comments which no longer apply.
Github also provides a "rendered diff" view which enables easily viewing
a diff of rendered markup (such as rst) instead of needing to review the
diff of the raw markup.

The Pull Request work flow also makes it trivial to enable the ability
to pre-test a change before actually merging it. Any particular pull
request can have any number of different types of "commit statuses"
applied to it, marking the commit (and thus the pull request) as either
in a pending, successful, errored, or failure state. This makes it easy
to see inline if the pull request is passing all of the tests, if the
contributor has signed a CLA, etc.

Actually merging a Github Pull Request is quite simple, a core reviewer
simply needs to press the "Merge" button once the status of all the
checks on the Pull Request are green for successful.

GitHub also has a good workflow for submitting pull requests to a
project completely through their web interface. This would enable the
Python documentation to have "Edit on GitHub" buttons on every page and
people who discover things like typos, inaccuracies, or just want to
make improvements to the docs they are currently writing can simply hit
that button and get an in browser editor that will let them make changes
and submit a pull request all from the comfort of their browser.

Phabricator

In addition to GitHub Pull Requests this PEP also proposes setting up a
Phabricator instance and pointing it at the GitHub hosted repositories.
This will allow utilizing the Phabricator review applications of
Differential and Audit.

Differential functions similarly to GitHub pull requests except that
they require installing the arc command line tool to upload patches to
Phabricator.

Whether to enable Phabricator for any particular repository can be
chosen on a case-by-case basis, this PEP only proposes that it must be
enabled for the CPython repository, however for smaller repositories
such as the PEP repository it may not be worth the effort.

Criticism

X is not written in Python

One feature that the current tooling (Mercurial, Rietveld) has is that
the primary language for all of the pieces are written in Python. It is
this PEPs belief that we should focus on the best tools for the job and
not the best tools that happen to be written in Python. Volunteer time
is a precious resource to any open source project and we can best
respect and utilize that time by focusing on the benefits and downsides
of the tools themselves rather than what language their authors happened
to write them in.

One concern is the ability to modify tools to work for us, however one
of the Goals here is to not modify software to work for us and instead
adapt ourselves to a more standard workflow. This standardization pays
off in the ability to re-use tools out of the box freeing up developer
time to actually work on Python itself as well as enabling knowledge
sharing between projects.

However, if we do need to modify the tooling, Git itself is largely
written in C the same as CPython itself is. It can also have commands
written for it using any language, including Python. Phabricator is
written in PHP which is a fairly common language in the web world and
fairly easy to pick up. GitHub itself is largely written in Ruby but
given that it's not Open Source there is no ability to modify it so it's
implementation language is completely meaningless.

GitHub is not Free/Open Source

GitHub is a big part of this proposal and someone who tends more to
ideology rather than practicality may be opposed to this PEP on that
grounds alone. It is this PEPs belief that while using entirely
Free/Open Source software is an attractive idea and a noble goal, that
valuing the time of the contributors by giving them good tooling that is
well maintained and that they either already know or if they learn it
they can apply to other projects is a more important concern than
treating whether something is Free/Open Source is a hard requirement.

However, history has shown us that sometimes benevolent proprietary
companies can stop being benevolent. This is hedged against in a few
ways:

-   We are not utilizing the GitHub Issue Tracker, both because it is
    not powerful enough for CPython but also because for the primary
    CPython repository the ability to take our issues and put them
    somewhere else if we ever need to leave GitHub relies on GitHub
    continuing to allow API access.
-   We are utilizing the GitHub Pull Request workflow, however all of
    those changes live inside of Git. So a mirror of the GitHub
    repositories can easily contain all of those Pull Requests. We would
    potentially lose any comments if GitHub suddenly turned "evil", but
    the changes themselves would still exist.
-   We are utilizing the GitHub repository hosting feature, however
    since this is just git moving away from GitHub is as simple as
    pushing the repository to a different location. Data portability for
    the repository itself is extremely high.
-   We are also utilizing Phabricator to provide an alternative for
    people who do not wish to use GitHub. This also acts as a fallback
    option which will already be in place if we ever need to stop using
    GitHub.

Relying on GitHub comes with a number of benefits beyond just the
benefits of the platform itself. Since it is a commercially backed
venture it has a full-time staff responsible for maintaining its
services. This includes making sure they stay up, making sure they stay
patched for various security vulnerabilities, and further improving the
software and infrastructure as time goes on.

Mercurial is better than Git

Whether Mercurial or Git is better on a technical level is a highly
subjective opinion. This PEP does not state whether the mechanics of Git
or Mercurial is better and instead focuses on the network effect that is
available for either option. Since this PEP proposes switching to Git
this leaves the people who prefer Mercurial out, however those users can
easily continue to work with Mercurial by using the hg-git[2] extension
for Mercurial which will let it work with a repository which is Git on
the serverside.

CPython Workflow is too Complicated

One sentiment that came out of previous discussions was that the multi
branch model of CPython was too complicated for Github Pull Requests. It
is the belief of this PEP that statement is not accurate.

Currently any particular change requires manually creating a patch for
2.7 and 3.x which won't change at all in this regards.

If someone submits a fix for the current stable branch (currently 3.4)
the GitHub Pull Request workflow can be used to create, in the browser,
a Pull Request to merge the current stable branch into the master branch
(assuming there is no merge conflicts). If there is a merge conflict
that would need to be handled locally. This provides an improvement over
the current situation where the merge must always happen locally.

Finally if someone submits a fix for the current development branch
currently then this has to be manually applied to the stable branch if
it desired to include it there as well. This must also happen locally as
well in the new workflow, however for minor changes it could easily be
accomplished in the GitHub web editor.

Looking at this, I do not believe that any system can hide the
complexities involved in maintaining several long running branches. The
only thing that the tooling can do is make it as easy as possible to
submit changes.

Example: Scientific Python

One of the key ideas behind the move to both git and Github is that a
feature of a DVCS, the repository hosting, and the workflow used is the
social network and size of the community using said tools. We can see
this is true by looking at an example from a sub-community of the Python
community: The Scientific Python community. They have already migrated
most of the key pieces of the SciPy stack onto Github using the Pull
Request-based workflow. This process started with IPython, and as more
projects moved over it became a natural default for new projects in the
community.

They claim to have seen a great benefit from this move, in that it
enables casual contributors to easily move between different projects
within their sub-community without having to learn a special, bespoke
workflow and a different toolchain for each project. They've found that
when people can use their limited time on actually contributing instead
of learning the different tools and workflows, not only do they
contribute more to one project, but that they also expand out and
contribute to other projects. This move has also been attributed to the
increased tendency for members of that community to go so far as
publishing their research and educational materials on Github as well.

This example showcases the real power behind moving to a highly popular
toolchain and workflow, as each variance introduces yet another hurdle
for new and casual contributors to get past and it makes the time spent
learning that workflow less reusable with other projects.

References

Copyright

This document has been placed in the public domain.

[1] Open Hub Statistics

[2] Hg-Git mercurial plugin