Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to Poetry for dependency management #605

Merged
merged 36 commits into from
Sep 13, 2022
Merged

Conversation

juhoinkinen
Copy link
Member

@juhoinkinen juhoinkinen commented Aug 11, 2022

Closes #601.

Besides creating the pyproject.toml file with dependencies, these are done:

  • Move pytest configuration from pytest.ini file to pyproject.toml section/table; delete pytest.ini
  • Delete MANIFEST.in file (the same contents could have been explicitly marked to be included in pyproject.toml, but the contents seemed to be included in the distribution given by poetry build automatically, TODO: needs to be double-checked Checked: all necessary static content for Web UI are automatically included.)
  • Delete now obsolete setup.py
  • Make bumpversion update version string in pyproject.toml
  • Update README.md for usage of Poetry
  • Add poetry.lock to .gitignore
  • Use Poetry in GitHub Actions CI/CD pipeline

Points to note:

For bumping version for release there exists the poetry version command that can bump patch, minor and major parts, but that applies only to the version string in pyproject.toml. There are also poetry plugins with different customizations (also for modifying version strings in any file, e.g. monim67/poetry-bumpversion), but it seems that the current approach with bumpversion is well applicable. Importantly the now-used bumpversion handles the git tag creation and commit, which I did not find any Poetry tool/plugin to be doing, and also allows bumping to *-dev versions. For now the bumpversion config should still be kept in setup.cfg; there is an open issue for pyproject.toml support.

I still try to refine the GH Actions pipeline to make a system-wide installation (I thought this should be working with setting POETRY_VIRTUALENVS_CREATE: false, but it is not) and by adding caching. To verify the release process works, I think a test release to test-PyPI could be done.

@codecov
Copy link

codecov bot commented Aug 11, 2022

Codecov Report

Merging #605 (4a6e2b5) into master (da680d3) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #605   +/-   ##
=======================================
  Coverage   99.61%   99.61%           
=======================================
  Files          87       87           
  Lines        6038     6038           
=======================================
  Hits         6015     6015           
  Misses         23       23           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@juhoinkinen
Copy link
Member Author

Force-pushed after some commit clean-up.

The poetry install install commands has large time overhead (~15 s), so it seems best to call it only once to install all (regular and optional) dependencies.

Still needs to check the release job with test-pypi.

@juhoinkinen
Copy link
Member Author

I checked with the branch test-pypi-release that the current setup works for publishing releases (using necessary alterations for using test-PyPI project as destination).

@juhoinkinen
Copy link
Member Author

juhoinkinen commented Aug 16, 2022

There was weird behaviour in resolving flake8 version: when running poetry install version 2.5.5 was installed, and then running poetry update the version was upgraded to 4.0.1 (with the accompanying dependency updates). This happened also when the project included only autopep8 and flake8. Clearing Poetry cache had no effect.

Pinning autopep8 did not change anything, only pinning flake8 version to 4.* worked: now running poetry update just after install should not find anything to update.

To do when merged: The Wiki page for installing dependencies for optional features should be updated. No need: the current pip commands can still be used.

@juhoinkinen
Copy link
Member Author

The installation of Poetry itself is now instructed in README to be done by curling and running the installation script from https://install.python-poetry.org/, and in GH Actions the install-poetry action does the same.

Poetry documentation states also pipx and manual (with pip and venv) ways to install Poetry. They could be a bit more secure, but also less convenient ways.

Maybe the Poetry installation in GH Actions could be done without the ready-made action but with just curling the script and running md5sum check for it?

@juhoinkinen juhoinkinen marked this pull request as ready for review August 16, 2022 11:11
@osma
Copy link
Member

osma commented Aug 16, 2022

I tested this briefly on two machines: a desktop with fixed Ethernet and a laptop with Wi-Fi. For both machines, the "Resolving dependencies" step in poetry install took a long time: 108 seconds for the desktop and 134 seconds for the laptop. In my understanding, this isn't something you'd do every day, so maybe it's not too worrisome. But it's quite a bit slower than the pip equivalent.

More worrying is that the install slowdown also affects CI jobs. The latest run with Poetry took 6 minutes 42 seconds, while a recent non-Poetry run took 3 minutes 55 seconds. So the CI run now takes almost three minutes longer! I wonder if there's anything that could still be done to speed up the installation step under CI?

However, slow install seems to be a known issue in Poetry (and there are many other similar but more recent issues). It's also the first question in the FAQ. According to the FAQ, the main reason for the slow operation is the lack of dependency metadata on PyPI, which leads to lots of extra work downloading and inspecting individual package versions from PyPI. This gives me two ideas for possible improvements of this PR:

  1. Specify the versions of packages we depend on more strictly. No * or >x.x allowed! There's another FAQ entry about this. Maybe it will reduce the work Poetry has to perform.
  2. It appears that Poetry will cache the information it has painstakingly collected from PyPI. For example, if I delete poetry.lock and run poetry install again, it only takes 6-7 seconds on both machines. Try to keep that cache under GitHub Actions!

I like the way Poetry shows outdated packages a lot:

$ poetry show --outdated
autopep8             1.6.0  1.7.0  A tool that automatically formats Python code to conform to the PEP 8 style guide
blis                 0.7.8  0.9.1  The Blis BLAS-like linear algebra library, as a self-contained C-extension.
catalogue            2.0.8  2.1.0  Super lightweight function registries for your library
click                8.0.4  8.1.3  Composable command line interface toolkit
coverage             6.2    6.4.3  Code coverage measurement for Python
flake8               4.0.1  5.0.4  the modular source code checker: pep8 pyflakes and co
flatbuffers          1.12   2.0    The FlatBuffers serialization format for Python
gast                 0.4.0  0.5.3  Python AST that abstracts the underlying Python version
google-auth-oauthlib 0.4.6  0.5.2  Google Authentication Library
mccabe               0.6.1  0.7.0  McCabe checker, plugin for flake8
protobuf             3.19.4 4.21.5 Protocol Buffers
pycodestyle          2.8.0  2.9.1  Python style guide checker
pydantic             1.8.2  1.9.2  Data validation and settings management using python 3.6 type hinting
pyflakes             2.4.0  2.5.0  passive checker of Python programs
scikit-learn         1.1.1  1.1.2  A set of python modules for machine learning and data mining
scipy                1.8.1  1.9.0  SciPy: Scientific Library for Python
simplemma            0.7.0  0.8.0  A simple multilingual lemmatizer for Python.
smart-open           5.2.1  6.0.0  Utils for streaming large files (S3, HDFS, GCS, Azure Blob Storage, gzip, bz2...)
spacy                3.3.1  3.4.1  Industrial-strength Natural Language Processing (NLP) in Python
tensorboard          2.9.1  2.10.0 TensorBoard lets you watch Tensors Flow
thinc                8.0.17 8.1.0  A refreshing functional take on deep learning, compatible with your favorite libraries
typer                0.4.2  0.6.1  Typer, build great CLIs. Easy to code. Based on Python type hints.
yake                 0.4.5  0.4.8  Keyword extraction Python package

(the color coding gets lost in the copy&paste)

@osma
Copy link
Member

osma commented Aug 16, 2022

I tested re-running the CI jobs for this PR. First attempt took 5min 30s, second took 5min 35s. So not quite as slow as the first time, but still much slower than without Poetry. The "Install Python dependencies" step seems to take around 2 minutes for all three Python versions.

@osma
Copy link
Member

osma commented Aug 17, 2022

I see that there are quite a few Poetry-related actions in the GitHub Actions Marketplace: https://github.com/marketplace?type=actions&query=poetry+

Maybe one of the others would be faster or otherwise better than the current one?

@juhoinkinen
Copy link
Member Author

  1. Specify the versions of packages we depend on more strictly. No * or >x.x allowed! There's another FAQ entry about this. Maybe it will reduce the work Poetry has to perform.

This did not help (at least much). With strict pinning: Version solving took 93.671 seconds. Previously, without strict pinning, this took ~100 seconds.

  1. It appears that Poetry will cache the information it has painstakingly collected from PyPI. For example, if I delete poetry.lock and run poetry install again, it only takes 6-7 seconds on both machines. Try to keep that cache under GitHub Actions!

This worked! Caching directory ~/.cache/pypoetry/cache/repositories makes the dependencies resolution to go in few seconds, and the CI runs overall in less than 3 minutes.

A bit surprising that this was not cached already, as the virtual environment was, by the setup-python action feature. Well, that cache is stated to apply to virtual environment (only) as stated in the docs, but I missed that.

juhoinkinen and others added 13 commits August 17, 2022 14:27
Revert "Fix pytest-cov version"
This reverts commit c352262.

Revert "Try to speed dependency resolution by pinning versions to minor level of all deps"
This reverts commit 0067a15.

Revert "More verbose poetry install for getting info about slow steps"
This reverts commit 99348be.
Include also .local/bin where pipx places symlink for Poetry
Builder stage was needed for building fastText, which is now installed
with prebuilt wheels
Resolved conflicts due to update of setup.py by adjusting pyproject.toml:
- upgrade to Simplemma 0.8
- upgrade to Click 8.1.*
Allows using cached nltk data layer when source changes
@juhoinkinen
Copy link
Member Author

I changed README.md to instruct to use pipx instead curl|bash for installing Poetry, as pipx might be the way for Annif installations for end-users in the future.

Without layer caching building Docker image with Poetry is slow (6-7 mins in GH Actions), but actually has not been much faster with pip. The layer caching could be implemented in GH Actions (an example), but that can be a different PR, and the Docker build occurs anyway only on pushes to master, not to every push to PR branches.

@juhoinkinen juhoinkinen modified the milestones: Short term, 0.59 Sep 7, 2022
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@juhoinkinen juhoinkinen merged commit 510a7c0 into master Sep 13, 2022
@juhoinkinen juhoinkinen deleted the issue601-poetry branch September 13, 2022 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch to Poetry for dependency management? 1.2.0 Release
2 participants