-
Notifications
You must be signed in to change notification settings - Fork 4
Pandas Development FAQ
Based on https://github.com/pydata/pandas/pull/4404#issuecomment-22864665 this wiki page gathers oft-asked questions/comments from contributors to make the contribution process a bit less painful.
The aim is to make it easier for
- Core developers to give advice & accept new code contributions.
- New contributors to find an easier way in for quick and efficient bug-fixes or feature additions
While some questions/comments/advice may be applicable to general programming, these are things that directly relate to pandas
development.
- PR == pull request
-
core developer: A person contributing on very high frequency & who is familiar with the code base and development process of
pandas
. - contributors: The occasional contributor, maybe from a specific domain, contributes bug fixes, features or documentation with low frequency, may not be an every day programmer (e.g. programming scientists or engineer using pandas for data processing) and looks at things from an end-user perspective.
Q: What are some recommendations for writing unit tests?
A: Your test should be self-contained. That is, it should test preferably a single thing, e.g., a method that you've added to the DataFrame
class. Your test function/method should start with test_
and the rest of the name should be related to whatever functionality you're testing, like test_replace_with_dict_regex
.
Q: Help! I can't get the tests to run!
A: You probably either have multiple Python versions installed and there's an ABI (application binary interface) issue or you forgot to build the extension modules in place. The latter can be done with
python setup.py build_ext --inplace
from the pandas
directory.
Q: Where do I need to change the settings in my GitHub configuration and/or Travis configuration for the Travis to start builds from my fork?
A: To be filled out.
Q: Why do I need a Travis file in my repo if it's already in the head repository?
A: Because we're not using subversion. Okay, seriously, it's because as far as git
is concerned your repository is the only one that exists. There's really no such thing as a "head" repository in the eyes of git
, those are concepts that we impose on it to make collaboration more effective and easier. This is one of the nice aspects of distributed version control.
Q: Does Travis build documentation?
A: Currently, no. There are some issues surrounding Sphinx error reporting. We are investigating ways to solve this problem.
- What is a typical workflow on my local fork?
- Shall I work in a virtualenvironment?
- Shall I work in a virtualenvironment and then copy my changes over into a clean local fork of my own repo?
Q: Who will be responsible for evaluating my PR?
A: Technically, anyone with push rights to the pydata/pandas
can evaluate it. In practice, there are a handful of people who are constantly watching the pandas
repo for new PRs, so most likely it'll be one of them that evaluates it. I'm not going to list names, but it's not that hard to figure out...
Q: What are the criteria for acceptance of a PR?
A: First and foremost, your fix must not break any existing functionality, one indicator of this is that your Travis build passes. Second, just give it some time. Everyone is busy and @wesm has not (yet?) amassed a pandas
development army.
Q: Do I need to open an issue first?
A: Not necessarily. If you want to submit a documentation change, e.g., a typo fix, then opening an issue is not necessary.
Q: What level of commenting is accepted?
A: The common sense level. Don't overdo it on the comments, and make sure if you do comment that your comments explain what your code is doing, not how it is doing it (that's what code is for).
Obligatory example:
BAD:
# increment i
i += 1
GOOD:
# add a person to the person count
i += 1
Q: How can I debug without adding loads of print
statements/calls everywhere?
A: You can use the Python standard library's pdb
and set a breakpoint. Put import pdb; pdb.set_trace()
at the line where you want to stop. ipdb
is pdb
with tab-completion and a few other bells and whistles, making debugging less painful. There's also ipdbplugin
which allows you to drop into ipdb
from nose
when a test fails via
nosetests --ipdb # or --ipdb-failures
Q: Would a logging hook be a solution?
A: That's probably a bit overkill. See the suggestions above.
- It would be nice to add more source comments to quickly understand the context when chiming in to fix an issue
Q: Why don't test functions have a docstring?
A: If your tests are self-contained and aren't sprawling ecosystems of spaghetti then having a docstring is redundant. Also, the test name is usually (and should be!) very descriptive. Remember there's no character limit for variable names. We're not using FORTRAN.
Q: DataFrame
s and other pandas
objects often many properties/methods. What is the level of detail that I should consider when I'm writing my test(s)?
A: See the previous question/answer. Strive to test one and only one thing. You could even separate out your tests by their formal parameters if you want things to be really self-contained.
Q: Should I consider possible corner cases of my implementation?
A: The answer is a resounding YES! In some cases you may come across something that is very pathological. In those cases you should ask a core developer.
- Some modules (e.g. io/parsers.py) seem to have grown into very high complexity. It is very time consuming to find out what is done where just for fixing a small bug.
- a splitting into several modules would be good
- more in-code comments telling why something is done and under which condition and for what expected result.
- even internal functions shall have a simple 1-line docstring