Skip to content
Radim Řehůřek edited this page Mar 29, 2016 · 38 revisions

The code for gensim is hosted here, on github. Contributions in the form of pull requests are welcome, be it for code or documentation. You may also report an issue or bug here.

If you don't feel confident in your git and/or Python, you can get up-to-speed with these tutorials. If your contribution is more in the idea department rather than code, use the gensim mailing list.

Documentation

Python docstrings are for an overview of the functionality, to anchor a class or method conceptually and check their parameters, not to describe how things work internally in detail. For all other cases, the code ought to be its own documentation. Any non-obvious tricks and coding patterns that may confuse an otherwise literate Python programmer need a source code comment.

Gensim is in permanent need of better tutorials, usage examples, as well as clearer docstrings. Contributions are most welcome.

API documentation that appears on the web is automatically generated from docstrings, via Sphinx:

cd docs/src
make clean html 
make upload # upload to web: need write access to radimrehurek.com

Git flow

Branching model follows http://nvie.com/posts/a-successful-git-branching-model/:

  • master branch is stable, HEAD is always the latest release
  • develop branch contains the latest code for the next release.
  • various feature branches, to be merged into develop upon completion

For a new feature, branch off develop:

$ git checkout -b myfeature develop

To merge a feature back into develop:

$ git checkout develop
$ git merge --no-ff myfeature
$ git branch -d myfeature
$ git push --tags origin develop

Making a new release

Check OSX and Win builds that track develop are passing: MacPython PR anad Appveyor

To start a new release, first branch off develop:

$ export RELEASE=0.7.8
$ git checkout -b release-${RELEASE} develop

Bump up version in setup.py and in docs/src/conf.py.

To finalize the release, re-generate Cython files (if changed):

    # regenerate word2vec C file
    cython gensim/models/word2vec_inner.pyx
    sed -i '' -e's/[[:space:]]*$//' gensim/models/word2vec_inner.c
    # regenerate doc2vec file
    cython gensim/models/doc2vec_inner.pyx
    sed -i '' -e's/[[:space:]]*$//' gensim/models/doc2vec_inner.c

    git commit -a -m "regenerated C files with Cython"

and then merge the branch into master and develop, tagging master:

git checkout master
git merge --no-ff release-${RELEASE}
git tag -a ${RELEASE} -m "${RELEASE}"
git push --tags origin master
git checkout develop
git merge --no-ff release-${RELEASE}
git branch -d release-${RELEASE}
git push origin develop

Add text description in https://github.com/piskvorky/gensim/tags

Update MacPython repo to build and upload to Rackspace OSX wheel:

git clone https://github.com/MacPython/gensim-wheels.git /tmp/gensim_wheels
cd /tmp/gensim_wheels
# delete gensim
rm -rf gensim
git submodule init
git submodule update
cd gensim
git checkout master && git pull
cd ..
git add gensim
git commit -m "updating gensim to latest"
git push
Wait for build and upload to finish in https://travis-ci.org/MacPython/gensim-wheels

Check Appveyor build has finished and uploaded new wheel to storag

Create clean folder for upload and run tests:

git clone https://github.com/piskvorky/gensim.git --branch master /tmp/gensim
cd /tmp/gensim
DTM_PATH=~/dtm/dtm_binary MALLET_HOME=~/Downloads/mallet-2.0.7 python ./setup.py test # run tests

Download Windows and OSX wheels(Note: Designed for upload in twine):

rm -rf dist && python setup.py sdist fetch_artifacts

Upload to PyPI:

twine upload dist/*

If you don't have twine, then use this instead: python ./setup.py register sdist upload (Please don't create a local UNIX wheel as PyPi doesn't accept UNIX wheels.) (Please don't create a universal wheel as pip prefers it over C-extensions.)

And update documentation at http://radimrehurek.com/gensim :

$ cd docs/src && make clean html upload

Code style

No trailing whitespace in source code. Whitespace on empty Python lines (lines separating blocks of code/methods etc.) is fine and even desirable, but not supported in git natively and therefore more difficult to work around.

To make sure you didn't introduce any trailing whitespace in your commit, enable the pre-commit hook (=move file .git/hooks/pre-commit.sample to .git/hooks/pre-commit). In this file, there should be a line:

exec git diff-index --check --cached $against --

Change it to:

exec git diff-index --check --cached $against gensim

Now every commit inside the gensim directory will be first checked by this hook. If there is trailing whitespace, the hook will refuse the commit and give you the offending line(s), which you must fix first. Fix either manually or en-masse with:

    $ # for MacOS and other BSD's
    $ find gensim -name '*.py' | xargs sed -i '' 's/[[:space:]]*$//'
    $ # for GNU sed (i.e. GNU/Linux distros)
    $ find gensim -name '*.py' | xargs sed -i 's/\s*$//'

Legal

By submitting your contribution to be included in the gensim project, you agree to assign joint ownership of your changes (your code patch, documentation fix, whatever) to me, Radim Řehůřek.

This means I will have the full rights to incorporate, distribute and/or further modify your changes, without any fees or restrictions from you.

This is needed in open-source projects because you are automatically the copyright owner of your contribution by law, and I couldn't do anything with it without your permission.

An example file header:

# Copyright (C) 2016 Radim Rehurek <radim@rare-technologies.com>
# Copyright (C) 2016 Your Name <me@gmail.com>
Clone this wiki locally