Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEP 639: Implement License-Expression and License-File #828

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

ewdurbin
Copy link
Member

@ewdurbin ewdurbin commented Sep 3, 2024

Work in progress, current open questions:

  • Is the dependency on license-expression OK? Should vendoring or a new parser be considered?
    • went the vendoring route per feedback from @abravalheri
    • migrated to a pure python parser based on pypa/hatchling's implementation
  • How much validation to apply to License-File, as-is as far as metadata goes the only clear spec is "Path delimiters MUST be the forward slash character (/), and parent directory indicators (..) MUST NOT be used.". It is unclear how to check for other delimiters, as paths may have no delimiters whatsoever
    • validations with path lib seem to fit the bill!
  • How to handle deprecations/conflicts. License and License-Expression are mutually exclusive, but it doesn't appear that such a conflict has existed before.
    • Based on the current behavior of the library, this kind of concern appears to be left to the caller. The metadata module appears to be focused on correctly parsing individual fields.

pyproject.toml Outdated Show resolved Hide resolved
@cdce8p
Copy link

cdce8p commented Sep 4, 2024

IIRC one of the concerns raised in the Discourse thread was regarding the total file size if license_expression would be added as a dependency or fully vendored. Just as example, the scancode-licensedb-index.json file is over 800kB.

I believe that's one of the reasons @ofek went a slightly different route with #799.

@ewdurbin
Copy link
Member Author

ewdurbin commented Sep 4, 2024

See a828bc7, which is a variant of #799 to build/use a minimal compatible json blob from the upstream for use with the vendored license-expression. Also excluding the data files from the vendored package from the resulting sdist/wheel to save space.

Overall resulting size difference for sdist and wheel

file 24.1 before after
sdist 148 kB 276 kB 212 kB
whl 54 kB 180 kB 112 kB

uncompressed:

file 24.1 before after
sdist 2539 kB* 1690 kB 829 kB
whl 176 kB 1281 kB 426 kB

* this appears to be due to the inclusion of tests/.pytest_cache/v/cache/lastfailed and tests/.pytest_cache/v/cache/nodeids files in the sdist, without it it is 608 kB.

Copy link
Sponsor Contributor

@ofek ofek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you please save the data as Python so deserialization is not required?
  2. I think we shouldn't merge this until we satisfy the parsing needs of backends (i.e. user defined metadata) rather than just consumers of packages like PyPI.
  3. Although I'm not a maintainer, I would be quite weary of packaging beginning to vendor dependencies.

@ewdurbin
Copy link
Member Author

ewdurbin commented Sep 4, 2024

  1. Can you please save the data as Python so deserialization is not required?

Looked closer and saw that hatchling's implementation is more complete than I initially understood, so I've migrated this PR to use an adapted version of that, and moved storage of the license data back to using the python based representation

  1. I think we shouldn't merge this until we satisfy the parsing needs of backends (i.e. user defined metadata) rather than just consumers of packages like PyPI.

Which needs are those? Can you delineate them? I'm interested in seeing this merged so PyPI's implementation can progress.

  1. Although I'm not a maintainer, I would be quite weary of packaging beginning to vendor dependencies.

See response to 1. :)

@ofek
Copy link
Sponsor Contributor

ofek commented Sep 4, 2024

  1. I think we shouldn't merge this until we satisfy the parsing needs of backends (i.e. user defined metadata) rather than just consumers of packages like PyPI.

Which needs are those? Can you delineate them? I'm interested in seeing this merged so PyPI's implementation can progress.

Now that you have normalize_license_expression which also raises an exception for invalid stuff, every backend would be satisfied. Thanks!

@ewdurbin ewdurbin mentioned this pull request Sep 4, 2024
Copy link
Sponsor Contributor

@ofek ofek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fantastic to my non-maintainer eyes!

@befeleme
Copy link

befeleme commented Sep 5, 2024

Just a tiny nitpick: The commit message contains a typo: "PEP 693" instead of 639 - this will make digging in the git history harder in the future.

Copy link
Contributor

@hroncok hroncok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much validation to apply to License-File, as-is as far as metadata goes the only clear spec is "Path delimiters MUST be the forward slash character (/), and parent directory indicators (..) MUST NOT be used.". It is unclear how to check for other delimiters, as paths may have no delimiters whatsoever

I think the spirit of the rule is that there are no backlash-delimeters in the path. However, I am not sure if this is enforceable if a filename can contain a backlash.

Also, as the path MUST be relative, we could assert it does not start with /.


I also noticed a license file is never checked for existence. Is that intentional? The PEP states it MUST be present (and the file content MUST be UTF-8 encoded text).

tests/test_metadata.py Outdated Show resolved Hide resolved
although inconsistent (see Requires-External, Provides-Extra, Provides-Dist, Obsoletes-Dist...) pluralize the result for multiple use of `License-File`

https://packaging.pypa.io/en/stable/metadata.html#packaging.metadata.RawMetadata

> Any core metadata field that can be specified multiple times or can hold multiple values in a single field have a key with a plural name.
Per feedback, integrate a variant of pypa#799 that builds a minimal JSON dataset to feed vendored license-expression

32K	src/packaging/_spdx.json

vs

848K	src/packaging/_vendor/license_expression/data/scancode-licensedb-index.json
- back away from vendoring
- adapt hatchling's license expression parser to accept well-formed LicenseRef-
- back to python storage of license data
@ewdurbin ewdurbin changed the title PEP 693: Implement License-Expression and License-File PEP 639: Implement License-Expression and License-File Sep 5, 2024
@ewdurbin
Copy link
Member Author

ewdurbin commented Sep 5, 2024

Just a tiny nitpick: The commit message contains a typo: "PEP 693" instead of 639 - this will make digging in the git history harder in the future.

🤦🏼 resolved with a rebase/force-push

@brettcannon
Copy link
Member

@ewdurbin I don't know why GitHub is showing my review as pending. Just let me know when you're ready for me to look at this PR again.

@brettcannon brettcannon removed their request for review September 13, 2024 17:57
Copy link
Member

@pradyunsg pradyunsg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll follow up separately with a review of the code but this LGTM based on a cursory look of that.

This also needs documentation, which I think we should add in this PR itself, along with using autodoc here.

src/packaging/licenses/spdx.py Outdated Show resolved Hide resolved
src/packaging/licenses/__init__.py Show resolved Hide resolved
tasks/licenses.py Outdated Show resolved Hide resolved
tasks/licenses.py Outdated Show resolved Hide resolved
tasks/licenses.py Outdated Show resolved Hide resolved
@ewdurbin
Copy link
Member Author

Docs added in 701217b, depends on pypa/packaging.python.org#1595

@ewdurbin
Copy link
Member Author

re-requesting review from @pradyunsg and @brettcannon.


python_expression = " ".join(python_tokens)
try:
result = eval(python_expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could globals and locals be passed in here? That feels safer, regardless if it is or not.

@ewdurbin
Copy link
Member Author

Outstanding un-resolved comments I am not opinionated on:

If some consensus around these could be found, are we good to proceed with a merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.