Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hash of dependency instead of object in a provider's deferred cache #5903

Closed

Conversation

Jinior
Copy link

@Jinior Jinior commented Jun 23, 2022

Pull Request Check List

Resolves: #5902

  • Added tests for changed code.
  • [n.a.] Updated documentation for changed code.

This is my first pull request so apologies if I did something wrong, I am open to feedback!

As described in the referred issue poetry often repeatedly re-downloads packages when resolving dependencies. This is due a bug which causing poetry thinking cached packages are not cached yet. The providers cache is a dictionary which uses a Dependency object a key (and the package as the value). During a dictionary lookup python checks equivalence on both the hash and eq for keys, but in some cases the Dependency object can change breaking the eq check.

This for example happens with the URL dependency, where the package version in only known after downloading which leads to the following:

  1. we check if the dependency is in the cache using the URLDependency as the key
  2. we download the package
  3. we check the package version from the downloaded file
  4. we update the version in the dependency object
  5. we store the package into the cache using the updated dependency object

This is shown in the code snippet below:

if dependency in self._deferred_cache:
    return [self._deferred_cache[dependency]]

package = self.get_package_from_url(dependency.url)

self.validate_package_for_dependency(dependency=dependency, package=package)

# truncated irrelevant stuff for this example

dependency._constraint = package.version
dependency._pretty_constraint = package.version.text

self._deferred_cache[dependency] = package

When searching for this dependency again we check the cache dictionary with a new URLDependency as a key, which has not been updated with the version number yet. We have missed the cache.

Adding the package to the cache before updating the dependency object is not an elegant solution in my opinion because that would make a package search with the updated dependency object miss the cache.

This pull request solves the problem by using the hashes of the Dependency objects as keys in the cache dictionary. The currently implemented hash methods already uniquely describe the package they download. (For example the URLDependency hash is made up of the url and package name; the VCS cache hash is made up of the package name, repository, branch, tag, and revision).

Another approach would have been to re-implement the eq method but I think this is also the wrong approach, because an URL dependency with an unbounded version is different than a URL dependency with a set version.

@radoering
Copy link
Member

Thanks, for your contribution and the detailed information. I'm not sure whether the __eq__ approach isn't better. Maybe, __eq__ should not consider the constraint for direct origin dependencies since the constraint is always implicitly given? Two direct origin dependencies with the same origin can't have different constraints, can they? The result should be the same as using the hash in _deferred_cache.

Maybe, we should try if replacing

return super().__eq__(other) and self._constraint == other.constraint

in

https://github.com/python-poetry/poetry-core/blob/4f9c35272cf53f56596c8a71175da6e50bb71b13/src/poetry/core/packages/dependency.py#L577

by

return super().__eq__(other) and (
    self._constraint == other.constraint or self.is_direct_origin()
)

does the trick, too. Do you want to try and create an alternative PR if it works so we can decide which path to take?

@radoering
Copy link
Member

@dimbleby I suppose, this PR is not handling the same download as #5871, but one of the others you mentioned, is it?

@dimbleby
Copy link
Contributor

@radoering yes this addresses a different one of the downloads. I think we have (at least!):

  • during locking poetry twice goes through search_for_url(), this MR is aimed at making sure that one download deals with both of those

  • during installation poetry again downloads wheels, Use locally cached wheels during install #5871 is aimed at arranging that if it already has the wheel in its cache then it can reuse it

Ideally poetry should also arrange that the search_for_url() download goes into the cache so that it can be reused during installation.

@Jinior
Copy link
Author

Jinior commented Jun 23, 2022

Thanks, for your contribution and the detailed information. I'm not sure whether the __eq__ approach isn't better. Maybe, __eq__ should not consider the constraint for direct origin dependencies since the constraint is always implicitly given? Two direct origin dependencies with the same origin can't have different constraints, can they? The result should be the same as using the hash in _deferred_cache.

Maybe, we should try if replacing

return super().__eq__(other) and self._constraint == other.constraint

in

https://github.com/python-poetry/poetry-core/blob/4f9c35272cf53f56596c8a71175da6e50bb71b13/src/poetry/core/packages/dependency.py#L577

by

return super().__eq__(other) and (
    self._constraint == other.constraint or self.is_direct_origin()
)

does the trick, too. Do you want to try and create an alternative PR if it works so we can decide which path to take?

@radoering thank you for the suggestion. I shall create a second PR today.
You are correct that two direct origin dependencies should have the the same constraints. But theoretically it could happen when a branch or path gets updated during locking but that is an unlikely situation.

I am unsure how __eq__ will be used in the future. I can imagine that a developer would assume that two dependencies with different constraints will not test equal. Perhaps they would be testing if the direct origin has changed.

What also comes to mind now is that we only update the constraint field in the dependency if it is not found in the cache; when we do have a cache hit we immediately return the package. (at least for the URLDependency). Should we change that?

@Jinior
Copy link
Author

Jinior commented Jun 24, 2022

I had checked the box for having added tests because I thought it was inapplicable in this case, but I am starting to think whether we should test if package are re-downloaded. I am however unsure how to write such a test.

@radoering
Copy link
Member

Additional tests are usually a good idea. 😄

A simple unit test in test_provider.py could look like this:

  • mock the method that downloads the file
  • search for dependency
  • change constraint of dependency
  • search for dependency (with changed constraint) again
  • check that "download method" was only called once

You might also write a test for version_solver that triggers the repeated downloads and check that the "download method" is only called once. Although this makes the reason for the test more explicit, I think a simple unit test with a comment about the use case might be enough.

@radoering
Copy link
Member

Using the hash as key probably makes _deferred_cache unusable for pure Dependency (because constraint is not considered in hash) and VCSDependency (because neither source_reference nor source_resolved_reference is considered in hash). It seems the former is not relevant because _deferred_cache is only used for direct origin dependencies. However, the latter seems to be relevant. I suppose, using the hash as key will break use cases were the same repo is used twice with different references (branch, tag, ...). Thus, python-poetry/poetry-core#405 might be the better alternative.

@radoering
Copy link
Member

Due to the shortcomings mentioned in my previous comment I'm closing this one in favor of python-poetry/poetry-core#405. A test checking that a dependency is only downloaded once is still appreciated. Please create a separate PR if you want to write such a test and note that the test should fail until a new poetry-core version is released and used in master of poetry.

@radoering radoering closed this Jul 3, 2022
Copy link

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poetry repeadetly re-downloads url dependencies during dependency solving
3 participants