feat(plugin-essentials): yarn dedupe #1558

paul-soporan · 2020-07-08T12:15:49Z

What's the problem this PR addresses?

To quote from the command details:

Duplicates are defined as descriptors with overlapping ranges being resolved and locked to different locators. They are a natural consequence of Yarn's deterministic installs, but they can sometimes pile up and unnecessarily increase the size of your project.

How did you fix it?

This PR adds a new yarn deduplicate command that does what its description says.

It's inspired by (but the implementation is different):

The current algorithm currently takes 30 seconds to run on our repo (after a lot of optimizations, it was 10 minutes before). I couldn't think of a better one that's also fast and produces results while resolving descriptors (so that the output is populated progressively while the deduplication runs; it would look weird if it took 30 seconds to resolve everything and then we would show all 200 results at once).

~~The current algorithm prefers reusing packages with the highest usage count, rather than with the highest version, which means that packages can be both upgraded and downgraded.~~
After discussing with the rest of the team, we came to the conclusion that the best strategy would be to sort the locators by version, rather than by usage count. This means that packages can only be upgraded, never downgraded.

Note: I haven't added tests yet, but I'm working on it.

Checklist

I have read the Contributing Guide.

I have set the packages that need to be released for my changes to be effective.

I have verified that all automated PR checks pass.

7rulnik · 2020-07-08T12:45:29Z

One question: I thought that it's yarn's v1 bug and it should deduplicate deps by default. Am I missing something?

paul-soporan · 2020-07-08T13:18:37Z

@7rulnik Yarn doesn't deduplicate dependencies by default, otherwise installs wouldn't be deterministic and the lockfile would be useless.
What it actually does is that it tries to not duplicate dependencies in the first place.

Example:

Lockfile:

foo@^2.3.4:
  resolution: foo@2.3.4

Now if I run yarn add foo@*, it will reuse foo@2.3.4, even if the latest foo is actually 2.10.14, thus preventing unnecessary duplication.

The problem (not actually a problem, this is how it's intended to be) is that Yarn doesn't unlock dependencies that are already locked in the lockfile.

This means that with the following lockfile:

foo@^2.3.4:
  resolution: foo@2.3.4

If I run yarn add foo@2.10.14, it will install foo@2.10.14 because the existing resolutions don't satisfy the range 2.10.14. This, however, leads to (sometimes) unwanted duplication, since now I've got the following lockfile:

foo@^2.3.4:
  resolution: foo@2.3.4

foo@2.10.14:
  resolution: foo@2.10.14

This lockfile has 2 separate resolutions for the 2 foo descriptors, even though they have overlapping ranges and the lockfile could be simplified to the following:

"foo@^2.3.4, foo@2.10.14":
  resolution: foo@2.10.14

This is, essentially, what the yarn deduplicate command does.

7rulnik · 2020-07-08T14:39:12Z

Got it, thanks!

I thought that it's a bug because yarn v1 did not have a deduplicate command. I think we should add your explanation to docs.

eps1lon · 2020-07-09T15:45:55Z

Great to see some improving on the initial work! So the only problem with https://github.com/eps1lon/yarn-plugin-deduplicate was that it is slow? I started working on it again after the 2.1 announcement and it should be working with 2.1 now after yarnplugins/yarn-plugin-deduplicate#6 is merged.

paul-soporan · 2020-07-09T18:39:35Z

@eps1lon I'm sorry for how I worded that, it's not that it was slow.

I wanted to open a PR in this repo that deduplicates some dependencies (because they started to add up in size) - #1556 - and I tried using your plugin by copying your implementation inside a new command in our repo (because we have some limitations that prevent us from installing third-party plugins in this repo), just to use it for deduplication purposes. Unfortunately, when I tested it, I first encountered a bunch of TypeError: Cannot read property 'reference' of undefined because of unsafe non-null assertions. When I fixed those, the plugin produced no results. I admit that I haven't tried debugging it further, probably also because I hadn't seen any updates to the plugin since September and because I've seen your message on Discord that I interpreted as you not wishing to continue developing this plugin - looking back, there's a big chance I might have misinterpreted that message and I'm really sorry if that's the case.
After that, I decided to write a quick (and really inefficient - that's what I was talking about when I said it was slow) implementation my own way, without using any of your code (looking at it, the locatorsByIdent part seems to be the only thing that's overlapping). After that implementation proved to be a success, I talked with the rest of the team and they agreed that this command could live in the berry repo if it's well tested. After that, I greatly improved the performance of my original implementation (I still have to tweak the current one, there are quite a few unneeded parts remaining) and opened this PR, and that's how we got here. 😅

I realize that we are now in a really unfortunate situation (that's mostly my fault), now that you've decided to revive yarn-plugin-deduplicate. I'm open to merging my work with yours if you wish, but I think that it would be best if this command lived inside the berry repo as a built-in command - seeing how many people would like to use it.

Again, I'm sorry for how this played out. I'm open to suggestions about what we should with this, as long as they benefit the community. What do you think about this? How should we proceed?

eps1lon · 2020-07-09T19:28:42Z

Unfortunately, when I tested it, I first encountered a bunch of TypeError: Cannot read property 'reference' of undefined because of unsafe non-null assertions.

Should be fixed now if you install from master. It wasn't about non-null assertions. They were only the symptoms of an underlying problem that was fixed with 03a68f7 (#6).

I'm open to merging my work with yours if you wish, but I think that it would be best if this command lived inside the berry repo as a built-in command - seeing how many people would like to use it.

I'd prefer that as well so that we can test integration.

Again, I'm sorry for how this played out. I'm open to suggestions about what we should with this, as long as they benefit the community. What do you think about this? How should we proceed?

All good. Having this live in an official repo was always my preference.

It being slow wouldn't have bothered me. It wasn't intended to be fast. Build something working first and then make it fast!

Bnaya · 2020-08-01T15:33:40Z

Will it support the strategy flag and selective dedup that yarn-deduplicate supports?

paul-soporan · 2020-08-06T20:37:41Z

@Bnaya It already supports selective dedupe (via glob pattern rest args). I was initially thinking of leaving strategies out of the first iteration, but now that there's demand I will implement it in this PR.

Note: I'll probably get back to this PR in a few days / next week.

paul-soporan · 2020-08-19T17:29:43Z

So... the PR is now finished. I've decided to leave out the fewer strategy for now, so that we can merge this PR sooner (hopefully it will be a part of 2.2.0, which is just around the corner).

More details about the yarn dedupe command:

It's part of plugin-essentials, which means that now it's a builtin command (no plugins needed).
I've decided to rename it to yarn dedupe from yarn deduplicate, as it's shorter to type and it doesn't conflict with the yarn-plugin-deduplicate command name. I also think I like this name more.
It only supports the highest strategy for now. Quoting from the docs:

Reuses (where possible) the locators with the highest versions. This means that dependencies can only be upgraded, never downgraded. It's also guaranteed that it never takes more than a single pass to dedupe the entire dependency tree.

It's not tied to semver, it can dedupe any kind of packages (e.g. Python packages which use the PEP440 versioning format).
It has a -c,--check flag that can be used on CI (doesn't persist the changes on disk, only exits with exit code 1 when there are packages that can be deduped).
It supports selective dedupe via glob pattern rest args (yarn dedupe lodash 'react-*' '@babel/*')
It supports a --json flag that makes the output be NDJSON.
It comes with an in-depth explanation in the docs (https://deploy-preview-1558--yarn2.netlify.app/cli/dedupe#in-depth-explanation).
It's fast enough because of concurrency. This is important because we have to send requests to the registry to get the candidates for each package (each package that should be checked - the number is dependent on the selective dedupe), since the command is not tied to semver. It takes about 8 seconds to execute on our repo without selective dedupe.
It comes with a footer that shows how many packages can be deduped (maybe TODO in the future: show how much space can be saved?).
I've added in-depth tests for everything.

Now I'm going to summon @arcanis to review this PR and hopefully yarn dedupe can be shipped in 2.2.0.

arcanis · 2020-08-19T17:43:13Z

Great work! I'll look in details tomorrow, but I have some questions based on your abstract:

It's fast enough because of concurrency. This is important because we have to send requests to the registry to get the candidates for each package (each package that should be checked - the number is dependent on the selective dedupe), since the command is not tied to semver. It takes about 8 seconds to execute on our repo without selective dedupe.

Why do we need to make network requests when doing the highest strategy (which is the only one supported atm)? Will it upgrade the packages on top of deduping them? Shouldn't it just update the resolutions to use whatever is the highest in the lockfile? Since those packages have already been resolved, no further request should be needed.

It's not tied to semver, it can dedupe any kind of packages (e.g. Python packages which use the PEP440 versioning format).

I'm not familiar with PEP440 - how does it support non-semver ranges (in particular with the highest strategy, which implies sorting the versions as much as possible).

It has a -c,--check flag that can be used on CI (doesn't persist the changes on disk, only exits with exit code 1 when there are packages that can be deduped).

I'm not necessarily against using -c,--check, but we'll have to add to the v3 that we need to normalize yarn version check, yarn dedupe -c, and yarn constraints.

paul-soporan · 2020-08-19T18:48:56Z

Why do we need to make network requests when doing the highest strategy (which is the only one supported atm)? Will it upgrade the packages on top of deduping them? Shouldn't it just update the resolutions to use whatever is the highest in the lockfile? Since those packages have already been resolved, no further request should be needed.

Nope, it doesn't update the packages, it just uses whatever is the highest in the lockfile. The problem is that there's no clear definition of "highest". For npm semver ranges, the highest is the highest version that satisfies the range, but what is the highest for npm tags? Let's take the following example:

Lockfile:

foo@npm:next:
  resolution: foo@npm:1.0.0

foo@npm:^1.1.0:
  resolution: foo@npm:1.1.0

Let's now say that the next tag points to foo@npm:1.1.0 at the moment I want to run yarn dedupe. Now how do I know what to do with foo@npm:next? There's nothing in the lockfile telling me that foo@npm:1.1.0 is a candidate of foo@npm:next. The only way to solve this is to call resolver.getCandidates on foo@npm:next.

There's also the problem of non-semver ranges, which we wouldn't be able to deduplicate unless we use the resolver, which already has a definition of "highest" - the first candidate.

I'm not familiar with PEP440 - how does it support non-semver ranges (in particular with the highest strategy, which implies sorting the versions as much as possible).

PEP440 is just a versioning format used in the Python ecosystem. It's similar to semver, but many things don't overlap (examples of valid PEP440 versions that aren't valid semver versions: 0.9, 2012.4, 1!1.0, 1.0rc1). Hardcoding semver here would be wrong (especially because we already have the resolver system which is versioning-scheme agnostic), as Yarn isn't just a Node package manager anymore.

I'm not necessarily against using -c,--check, but we'll have to add to the v3 that we need to normalize yarn version check, yarn dedupe -c, and yarn constraints.

🤔 It makes sense for yarn version check to be a separate command from yarn version, as they have nothing in common apart from being part of the same plugin. Regarding yarn constraints, I've typed yarn constraints check more times than I'd like to admit, so... I'm not sure what's the best thing to do here. yarn constraints and yarn constraints --fix don't really do the same thing - the first one produces warnings while the second one just fixes everything silently. The way this PR is currently implemented, both yarn dedupe and yarn dedupe --check output what can be deduplicated, the difference being that the former runs an install after, while the latter exits the process with a specific code.

paul-soporan · 2020-08-20T23:03:09Z

So, after more discussions, we've decided to drop both the network access requirement and the reliance on semver by refactoring the core resolvers to have a new getSatisfying method (that we might also use for running yarn add without relying on the network in the future) that statically (more or less) computes which locator references satisfy a descriptor range. This means that the NpmSemverResolver can now use the semver package internally, which reduces the execution time by about 95%.
This means that the yarn dedupe command will be fully universal and also very fast.

The remaining problems (?):

some resolvers (e.g. ExecResolver, PatchResolver, FileResolver) currently do their entire dynamic candidate resolving even with getSatisfying. I'm not sure how we can avoid that, as we can't bypass the FS access requirement and only do static resolving. This isn't a problem in practice, as it doesn't make any sense to dedupe a patch: or an exec: dependency, but it's still an interesting problem to solve.
"foo": "npm:bar@*" dependencies (aka remapped npm dependencies) can't be currently deduplicated, as the NpmRemapResolver currently takes ownership of the final resolution, rather than proxying it through a resolutionDependency. This problem will be solved in a future PR and we will also add a test to make sure that deduping works in that case.

Unless there are any design problems remaining, I think this PR can be reviewed and I will do my best to address any remaining small problems so that we can hopefully ship this in 2.2.0.

packages/plugin-essentials/sources/commands/dedupe.ts

packages/plugin-exec/sources/ExecResolver.ts

packages/plugin-essentials/sources/commands/dedupe.ts

Co-authored-by: Maël Nison <nison.mael@gmail.com>

packages/plugin-npm/sources/NpmSemverResolver.ts

Co-authored-by: Kristoffer K. <merceyz@users.noreply.github.com>

packages/plugin-npm/sources/NpmSemverResolver.ts

Co-authored-by: Kristoffer K. <merceyz@users.noreply.github.com>

arcanis · 2020-08-23T16:11:09Z

feat(plugin-essentials): yarn deduplicate

79f42d3

paul-soporan requested a review from arcanis as a code owner July 8, 2020 12:15

refactor: improve implementation

f1ce7d2

paul-soporan marked this pull request as draft July 9, 2020 10:21

7rulnik mentioned this pull request Jul 9, 2020

Document why this is necessary scinos/yarn-deduplicate#56

Closed

scinos mentioned this pull request Jul 9, 2020

Support yarn berry scinos/yarn-deduplicate#64

Closed

SimenB mentioned this pull request Aug 10, 2020

Loosen babel-plugin-jest-hoist babel/istanbul package deps jestjs/jest#10389

Closed

paul-soporan added 2 commits August 17, 2020 03:09

Merge branch 'master' into paul/feat/deduplicate

05781f1

refactor: shorten command name to yarn dedupe

4303e0d

paul-soporan changed the title ~~feat(plugin-essentials): yarn deduplicate~~ feat(plugin-essentials): yarn dedupe Aug 17, 2020

paul-soporan added 8 commits August 17, 2020 15:55

refactor: rewrite algorithm

408acbf

refactor: more refactoring

7063670

refactor: even more refactoring

d5a2291

refactor: more refactoring + improved docs

925112a

docs: better dedupe docs

ebfd0c1

test: add tests

e8df62b

test: add tests for selective dedupe

658055c

Merge remote-tracking branch 'yarnpkg/master' into paul/feat/deduplicate

ea4cc55

paul-soporan marked this pull request as ready for review August 19, 2020 16:58

refactor: fix cyclic dependency

cc91662

paul-soporan marked this pull request as ready for review August 20, 2020 22:47

arcanis reviewed Aug 21, 2020

View reviewed changes

paul-soporan and others added 11 commits August 22, 2020 23:17

Update packages/plugin-essentials/sources/commands/dedupe.ts

51b342e

Co-authored-by: Maël Nison <nison.mael@gmail.com>

Update packages/plugin-essentials/sources/commands/dedupe.ts

4379dbe

Co-authored-by: Maël Nison <nison.mael@gmail.com>

refactor: dedupeSkip -> null

d637d8e

refactor: skip -> assertion

1a705f5

refactor: add note about virtual packages

cab447e

refactor: skip ->assertion failed

d6b9a58

Merge branch 'master' into paul/feat/deduplicate

58ac353

refactor: disable getSatisfying for some resolvers

9eb67d8

refactor: remove resolutionDependencies

aaf702f

chore: lint

2465986

refactor: skip -> assertion failed

1fc26fa

paul-soporan requested a review from arcanis August 23, 2020 10:59

merceyz reviewed Aug 23, 2020

View reviewed changes

packages/plugin-npm/sources/NpmSemverResolver.ts Outdated Show resolved Hide resolved

merceyz reviewed Aug 23, 2020

View reviewed changes

packages/plugin-npm/sources/NpmSemverResolver.ts Outdated Show resolved Hide resolved

Apply suggestions from code review

d05b284

Co-authored-by: Kristoffer K. <merceyz@users.noreply.github.com>

merceyz reviewed Aug 23, 2020

View reviewed changes

packages/plugin-npm/sources/NpmSemverResolver.ts Outdated Show resolved Hide resolved

refactor: use semver.SemVer instances

50f0c51

Co-authored-by: Kristoffer K. <merceyz@users.noreply.github.com>

arcanis merged commit d4ec648 into master Aug 23, 2020

arcanis deleted the paul/feat/deduplicate branch August 23, 2020 16:10

merceyz mentioned this pull request Aug 23, 2020

perf(plugin-npm): use cached range in NpmSemverResolver.getSatisfying #1744

Merged

3 tasks

JLHwung mentioned this pull request Aug 25, 2020

Use Yarn 2 babel/babel#11962

Merged

ylemkimon mentioned this pull request Sep 7, 2020

fix(npm): yarnDedupeHighest for Yarn 2 renovatebot/renovate#7200

Merged

This was referenced May 10, 2021

应用级 Monorepo 优化方案 worldzhao/blog#9

Open

Yarn duplicate及解决方案 worldzhao/blog#10

Open

layershifter mentioned this pull request May 12, 2021

RFC: Use ranges to define versions of dependencies for converged components microsoft/fluentui#17499

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(plugin-essentials): yarn dedupe #1558

feat(plugin-essentials): yarn dedupe #1558

paul-soporan commented Jul 8, 2020 •

edited

Loading

7rulnik commented Jul 8, 2020

paul-soporan commented Jul 8, 2020

7rulnik commented Jul 8, 2020

eps1lon commented Jul 9, 2020

paul-soporan commented Jul 9, 2020

eps1lon commented Jul 9, 2020 •

edited

Loading

Bnaya commented Aug 1, 2020

paul-soporan commented Aug 6, 2020

paul-soporan commented Aug 19, 2020

arcanis commented Aug 19, 2020

paul-soporan commented Aug 19, 2020 •

edited

Loading

paul-soporan commented Aug 20, 2020

arcanis commented Aug 23, 2020 •

edited

Loading

feat(plugin-essentials): yarn dedupe #1558

feat(plugin-essentials): yarn dedupe #1558

Conversation

paul-soporan commented Jul 8, 2020 • edited Loading

7rulnik commented Jul 8, 2020

paul-soporan commented Jul 8, 2020

7rulnik commented Jul 8, 2020

eps1lon commented Jul 9, 2020

paul-soporan commented Jul 9, 2020

eps1lon commented Jul 9, 2020 • edited Loading

Bnaya commented Aug 1, 2020

paul-soporan commented Aug 6, 2020

paul-soporan commented Aug 19, 2020

arcanis commented Aug 19, 2020

paul-soporan commented Aug 19, 2020 • edited Loading

paul-soporan commented Aug 20, 2020

arcanis commented Aug 23, 2020 • edited Loading

paul-soporan commented Jul 8, 2020 •

edited

Loading

eps1lon commented Jul 9, 2020 •

edited

Loading

paul-soporan commented Aug 19, 2020 •

edited

Loading

arcanis commented Aug 23, 2020 •

edited

Loading