Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help catch typo squatting when adding dependencies #10655

Open
epage opened this issue May 11, 2022 · 6 comments
Open

Help catch typo squatting when adding dependencies #10655

epage opened this issue May 11, 2022 · 6 comments
Labels
C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` Command-add S-triage Status: This issue is waiting on initial triage.

Comments

@epage
Copy link
Contributor

epage commented May 11, 2022

Problem

A user might cargo add fooo when they mean cargo add foo and get the wrong crate

Proposed Solution

When adding a new registry dependency, warn of dependencies that are an edit distance of 1-2 away from the specified crate. We should probably report their descriptions to hint to the user if the typo is for a different purpose. If the user didn't pass --offline, ideally we'd also report download counts as a very low download count is a likely smell.

Notes

We might also want this for cargo search (and cargo info if/when that gets added, #948).

See also killercup/cargo-edit#172

@epage epage added C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` Command-add labels May 11, 2022
@ehuss
Copy link
Contributor

ehuss commented May 11, 2022

Do you have any thoughts on how this would work with an HTTP index? I can't think of a way to support that without some kind of registry API.

@epage
Copy link
Contributor Author

epage commented May 11, 2022

Ever since I heard about the HTTP index, I've been concerned about this. I do not think that this and #10656 are isolated cases of needing to query whats crate names exist and that this is identifying a gap in the HTTP index. I suspect we'll need to have another file in the HTTP index that is the list of all of the crate names. Hopefully the frequency of new crate names is at a point where we still get caching benefits on that file. A flat list of crate names is 979kb. We could either organize it into a trie in a single file or in multiple, well defined files if we can't get partial updates of the file over HTTP.

@Eh2406
Copy link
Contributor

Eh2406 commented May 11, 2022

A tree of files that are available in the index would also be important for index signing. With the hash information, it is far too big for crates.io to have a flat file. It is 100% something I think we should do. My feeling is that we should not hold up the performance benefits of HTTP indexes, for something we can add backwards compatibly.

@kornelski
Copy link
Contributor

This doesn't have to be done client-side at use time. It can be done by crates.io at publication time.

At minimum, crates.io could detect which crates have "confusable" names and add some sort of warning metadata to their (sparse http) index files. Note that cached/stale registry data isn't an issue here, because a typosquatted crate is by definition published later than the popular crate it pretends to be.

@kornelski
Copy link
Contributor

kornelski commented May 12, 2022

Levenstein distance probably works for 90% of cases, but I expect there will be edge cases and complications, so having ability to update the detection algorithm and add exceptions independently of Rust releases will be valuable, so doing this server-side on crates.io is better.

In terms of detection, you may also want to check variations with swapped words (web-actix), synonyms and grammatical forms (logger vs logging), neutral suffixes like -rs, etc.

It will be necessary to choose which crate is the good one, and which is the bad one, because you don't want to sow uncertainty and advertise typosquatters when users ask for the good crate. i.e. cargo add serde shouldn't say "did you mean zerde?"

This choice is tricky. Usually older crate is the right one, but there's an exception of git vs git2 crates (fortunately in this case it's not malicious).

It can't simply be the more popular crate, because download numbers can be quickly and easily inflated. crates.io could make faking downloads harder, but overall it's a losing battle, especially when we're dealing with determined malicious actors here.

So the overall choice could be a mix of crate's age, age of its owners' accounts, manual moderation overrides, or maybe something fancy like owner reputation based on page-rank-like algorithm or cargo-crev web of trust.

@epage
Copy link
Contributor Author

epage commented May 12, 2022

so having ability to update the detection algorithm and add exceptions independently of Rust releases will be valuable, so doing this server-side on crates.io is better.

I can see doing this though it does increase the lift necessary to get this going and I suspect we could initially get away with a simpler, client-side and then scale up to the server-side approach. The client-side approach is sufficient for registry squatting (#10656 ). Independent of squatting, the logic needed for doing this client-side can improve the cargo search results and help provide spelling corrections for cargo add and cargo info when a crate name doesn't exist at all. It could also act as a fallback when a registry doesn't support the full feature to help keep minimum feature set for an independent registry down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` Command-add S-triage Status: This issue is waiting on initial triage.
Projects
None yet
Development

No branches or pull requests

4 participants