Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faulty license detection #777

Closed
filiptronicek opened this issue Jul 12, 2023 · 18 comments
Closed

Faulty license detection #777

filiptronicek opened this issue Jul 12, 2023 · 18 comments
Labels
bug Something isn't working

Comments

@filiptronicek
Copy link
Member

As described in EclipseFdn/publish-extensions#689, there is something wrong with how we detect licenses.

If you take GitLens and publish it at its latest version (14.0.1), the MIT License gets displayed in the WebUI for it. If you inspect the file itself, you discover that the license is indeed part MIT, but has a header basically stating that it does not apply to all code in the repository. This means that we tell our users the incorrect license.

I am trying to figure out how we do these license checks, but it looks like we're being a bit to lax.

We should rather have false negatives than false positives.

image image
@filiptronicek filiptronicek added the bug Something isn't working label Jul 12, 2023
@kineticsquid
Copy link
Contributor

@filiptronicek We're taking a different approach from VS Marketplace. What was the thinking?

@filiptronicek
Copy link
Member Author

I am unaware of what the initial thinking was, but I think it is a nice way to show license types where we can for users to more easily determine what extensions are FOSS and which are not. I think we should re-visit it and see how much value it adds, and whether it's still worthwhile keeping.

@amvanbaren
Copy link
Contributor

amvanbaren commented Jul 13, 2023

@filiptronicek The server searches for the license by using a list of well known open source licenses (MIT, GPL, BSD, etc.). The server doesn't take into account dual licensed extensions. Even if we were able to, I don't know how to concisely name this license (MIT & PLUS?) It's probably better to just add the license file to the Resources section and not display the license name in the header anymore.

@filiptronicek
Copy link
Member Author

filiptronicek commented Jul 13, 2023

@amvanbaren I agree with not showing the license name. If we want to be a bit clever, though, we could use an approach employed by GitHub, which I like: if a license format exactly matches the license file, use it. If it doesn't, just display License with a link to it.1

image image
image

Footnotes

  1. By the way, GitHub uses https://github.com/licensee/licensee

@amvanbaren
Copy link
Contributor

if a license format exactly matches the license file, use it. If it doesn't, just display License with a link to it.

Sounds good. We can call licensee using JRuby.

The other thing is that the ExtensionProcessor assumes that an extension only has 1 license file. It should look for all license files instead.

@kineticsquid
Copy link
Contributor

@amvanbaren This could represent a pretty substantial visible change to our users/adopters. Is there an easy way to determine how may extensions would have what shows for their license change?

@amvanbaren
Copy link
Contributor

@kineticsquid Is there an easy way to determine how may extensions would have what shows for their license change?

No, there isn't really an easy way. The quickest way is to get all licenses for each extension using licensee and compare that to the extension license in the database.

@amvanbaren
Copy link
Contributor

just noticed that rust-analyzer is dual licensed: https://open-vsx.org/extension/rust-lang/rust-analyzer/0.4.1584

@filiptronicek
Copy link
Member Author

filiptronicek commented Jul 13, 2023

@amvanbaren I think that comes from the extension manifest

@kineticsquid
Copy link
Contributor

@amvanbaren @filiptronicek I've just about finished running licensee on all the extensions. I'm having to do it in batches because it makes enough GH API calls that I run into rate limiting issues. Here's what I'm seeing:

In general, we can divide the extensions into three classes:

  1. An extension has license specified in package.json.
  2. An extension has license specified in files in package.json.
  3. An extension has neither of the above specified.

Running licensee on each extension will result in one of the following for each of the above classes:

  1. No license detected, or unable to run licensee because repo is private and can't be cloned.
  2. License detected and matches what's in package.json.
  3. License detected that does not match what's in package.json.

Given the above 3x3 matrix, what's the thinking on how we might use licensee and how that use could impact what shows up on an extension's page?

@amvanbaren
Copy link
Contributor

@kineticsquid Well, I would compare it to the current situation. I don't think much will change, just that the license detection is more accurate.

Licensee is able to detect multiple licenses. If we stick to the 1 license per extension version then we just take the license with the highest confidence score.

On the other hand it's quite a fundamental change if we want to provide all found licenses. Not just for the webui, but also for the database and API endpoints. API consumers like VSCode or VSCodium only expect 1 license, so we also have to keep that in mind.

@filiptronicek
Copy link
Member Author

API consumers like VSCode or VSCodium only expect 1 license

Worst case we hack around it with licenses.join(" & "), so that we display all info we have in the manner clients expect. I don't think they do any format checking.

@amvanbaren
Copy link
Contributor

@filiptronicek It's a link to the license file, not the license names. I guess we can divide an extension's licenses into a main license and other licenses.

@kineticsquid
Copy link
Contributor

I tried a number of different permutations of license files and license entries in package.json to see how our system behaved. By and large, with one exception, it behaved as I would have expected, with a couple of potential bugs (not high priority IMO) E.g.,

  • specify license in package.json that is not an official SPDX license term and do not include a license file. 404 as the license doesn't exist at the spdx.org URL we generate.
  • attempt to publish an extension and get an error, e.g. no license. Re-package and attempt to publish again and get an error that the version already exists.

The exception: based my observations, with this extension, https://open-vsx.org/extension/kineticsquid/jk-helloworld-minimal-sample-kineticsquid, I thought that if one specified license in package.json and included a file named license or LICENSE, what would be displayed on the extension's page would be the package.json entry with a link to the license file. I say this because I noticed that in GitLens package.json, the author specifies license and it is not 'MIT'.

Attempting to reproduce this, I took the LICENSE file from GitLens and removed the copyright header. I then included the same package.json entry as GitLens: package.json: "license": "SEE LICENSE IN LICENSE". With this package, 'MIT' gets displayed as the license. That's version 0.0.19: https://open-vsx.org/extension/kineticsquid/jk-helloworld-minimal-sample-kineticsquid/0.0.19.

In 0.0.21, https://open-vsx.org/extension/kineticsquid/jk-helloworld-minimal-sample-kineticsquid/0.0.21, I changed the package.json entry to: "license": "My special license", but did not change the license file. In this case, 'My Special License' is displayed.

it seems strange that the package.json entry would affect the license detection.

@kineticsquid
Copy link
Contributor

The above said, I think at a minimum, if an author specifies license in package.json, that's the text we should display and link to the provided license file. It's up to the author/publisher to ensure this information is correct. If an entry is misleading, a user reports it and we investigate.

I'm also wondering if attempting license detection is worth it. E.g. if there is a license file and no entry in package.json we display 'Provided License' and link to the file.

@kineticsquid
Copy link
Contributor

Thinking about this some more, I think this is what open-vsx should do:

  • If no license in package.json, error on publishing. This happens today.
  • If no license in package.json and license file present, do not attempt to understand the license file and instead display Provided License and link to the file.
  • If license is specified in package.json, display those words. If a license file is present, link to it.

@waynebeaton @mbarbero FYI

@amvanbaren
Copy link
Contributor

@kineticsquid Do you want to apply the same logic for README and CHANGELOG files?

@kineticsquid
Copy link
Contributor

@amvanbaren No, I think the current behavior is fine. To wit, just a message that indicates either No README available or No changelog available.

@github-project-automation github-project-automation bot moved this from Todo to Done in Eclipse Open VSX Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants