-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lockfile concerns #579
Comments
I really like the proposal in #422 to use a weighted SAT solver assuming it runs fast enough on large dependency graphs. Then we could optimize for as much deduping as possible, or using the latest packages as much as possible, and so on. It might end up being impractical but modern SAT solvers run impressively fast and can find an optimal solution when a solution exists. |
@ide the usual SAT solvers treat constraints as absolute: either two dependencies conflict or they don't. This is more-or-less Bundler's strategy. The strategy described above tries to unify as much as possible, while still allowing duplicates to exist. It's also a back-of-the-envelope strategy (Cargo does something similar, but the ecosystem is much more reliant on semver, defaults to the |
@ide your description implies that you know of existing SAT solvers that allow for weighted constraints. Links? |
@ide I'd also like to add that the Cargo algorithm is a few hundred lines of code in total (including comments) and is performant enough in practice. The bundler algorithm is a similar size, is written in Ruby, and is performant enough in practice (again: written in Ruby!). I don't think we need to wait to integrate an existing SAT solver library (unless someone is motivated to do the work and make sure it works reliably on the supported platforms) to do this improvement. |
I'm excited about your proposal @wycats. Is this something you are planning to actively work on? |
@wycats I was thinking about https://github.com/meteor/logic-solver (linked in that issue), which Meteor uses to compute its dependency graph. They've done the work of making it run on many systems via emscripten I think. There's a method called |
@cpojer yes. I was working on it a bit last week, but decided to write up the proposal first to get feedback before sinking a bunch of time into it. If people are enthusiastic, I think there are a few steps we can take to land the change incrementally. |
I don't really have anything useful to add and I'm probably not clever enough to help solve this, but I just wanted to mention that your description is fantastic and describes the problem very well 😄 For reference, here's how NuGet handles dependency resolution: https://docs.nuget.org/ndocs/consume-packages/dependency-resolution#dependency-resolution-rules. In general, it uses the lowest version that satisfies all version requirements (so in your example of |
Just for reference, a basically-analogous algorithm to the one @wycats is describing is also used in dart's pub, and my own take on it for Go is moving towards adoption, as well. |
@sdboyer <3 Some of my favorite conversations on this topic have been between the two of us, so I'm happy you recognized my description of the algorithm (it's a tricky algorithm to describe well). |
Not sure if quite matching this topic -- but the "resolved" field doesn't seem appropriate for a lockfile. I mean, you do need to store the resolved version / file hash, but storing even the absolute URL makes lockfiles generated in different machines (say, if a user is using a local sinopia cache) have different resolved URLs depending on user settings. In the worst case, a lockfile could be generated entirely with localhost URLs, even if fetching from the public registry would have the same result (same downloaded hash); so nobody would be able to install from the lockfile without also setting up a local cache server. The origin of the file is not important, what matters is that the hashes match. |
Just saw this, I'm doing some work to implement a basic version of #422 that uses the aformentioned logic solver by meteor. Seems like both your proposed solution and what I'm implementing might be trying to solve the same problem.
Possibly unimportant question - the algorithm will always output a single package that satisfies all constraints, but what is the advantage of having the graph of packages on top of this? At that point, couldn't you just have the list of installed exact packages? My understanding of the lockfile is just that it's a file to indicate that packages you already have installed, to compare with versions or ranges in the future, so definitely correct me if I'm wrong here.
Sorry, just to clarify, by "package duplication", do you just mean multiple versions of the same package?
Cool! A few questions off the top of my head - it seems like this could potentially result in a "non-optimal" configuration. Given that in large repos, you could potentially have a lot of "valid" configurations due to occurrences like transitive dependencies, or overlapping/conflicting dependencies at various parts of the dependency graph, you could potentially end up in a situation where you've optimistically chosen the first dependency as the "most recent" version, but as a result multiple packages further down the line suffer as a result. A simple example would be something where you have 3 packages -
The top level package.json requires With the algorithm described above, it'd look at Note that the example is very contrived, but hopefully my example gets my question across. You can extend this example with more packages with interwoven dependencies to simulate the backtracking checks - but the idea here is that packages wouldn't ever consider backtracking, because they're already in a "valid" state, because the dependencies are technically already satisfied by the existing set of activated packages - they just haven't explored the other options. I'll talk about the approach that @yunxing and I discussed offline, and was going forward with for #422 - I think both approaches are trying to solve the same underlying problem. Though, note that this was intended to only be applied during the With the SAT solver approach, the plan to implement a basic version of this is to model the package version resolution as a logic problem, and using the logic-solver to handle the actual solution-finding. If you take a look at the api for modeling the problem, you should be be able to model the problem as a series of The algorithm for setting up the logic problem would be as follows:
Theoretically, setting this up and telling the solver to solve the problem should result in a set of valid "choices". At this point, you can define the "best choice" heuristic however you want, by adding weights (as @ide mentions above), and running another solver on this smaller set of choices (for example, weighting more recent versions higher). The nice part is, all of the complexity of generating/traversing up and down the dependency graph no longer needs to be handled by us, the logic solver will just "handle" all of that for us. Because we've already fetched all of the information anyway, we just need to keep track of the final list of packages and their versions, and use the manifests we already fetched to install only those packages. Note that I'm still working on this, so I don't know the performance implications yet. It looks like Meteor is using this within their package manager, with more complex logic. cc @yunxing, @jordwalke, correct me if I'm wrong here... Props also goes to them (and maybe others that I don't know) for thinking of using the logic solver! |
Love this! Thanks for thinking about the problem. The proposed algorithm looks good to me, only concern is the performance of the algorithm with thousands of dependencies (this is NP-complete!). But even then we can add a separate command to generate the lock file first (so all subsequent installation can just read the lock file). @dxu has already started working on the SAT solver approach. It may not be as flexible as the proposed algorithm, but it could be a great fit when in "--flat" mode. I think we should share our plans here to avoid duplicated efforts. |
For reference, Bundler and CocoaPods’ resolver implemented in Ruby lives here. |
@dxu these are some great thoughts! lots to chew on. some quick responses to a couple points:
So yeah, I'm gonna make that argument 😸 But specifically in reference to having, or not having, "more out of date packages." The algorithm @wycats has described should be able to guarantee that, if a solution is found, then the most recent version of each package is used that satisfies all mutual constraints. (Assuming highlander/no duplication - I haven't yet absorbed the implications of allowing duplicates into my thinking). If the goal were to find the solution with the "minimal" distance from selected to newest versions for the entire depgraph, that would require an extra layer where some sort of cost function (probably having a similar shape to meteor's version pricer) defines those distances, then we look for multiple solutions and pick the one with the minimum total cost. (Or, some entirely different approach.) Either way, though, I think such an approach would necessarily have to explore the entire combinatorial version space to guarantee it had found "optimal distance." That would eliminate one of the main escape valves a solver can use to minimize running time: we only have to find one solution, not all possible ones. That makes it worth making strategic choices about the order in which deps that are likely to (further) minimize the amount of the space we have to explore. In a broader sense, though, I'm sorta on the fence about whether it's a Good Thing™ to keep as close to the latest release as possible. I mean, as a goal within dev groups, I think that's probably reasonable, and it's certainly the way most teams I've been on have operated. But that doesn't necessarily mean tooling should operate in the same way. Absent user intent to the contrary, it might be better to default in a conservative direction - one that instead focuses on minimizing change, and thereby minimizing risk. (That's the motivation behind preferred versions, which I'm still experimenting with.) |
For what it's worth I think this is extremely important. I have also found empirically that the longer it takes to find a solution, the more likely it is that the solution is conceptually wrong ("I found a version of Rails that is compatible with this version of ActiveMerchant. It's Rails 0.0.1, which didn't depend on ActiveSupport!") |
Bundler and Cargo both implement "conservative updates", which optimizes for minimal change rather than newest versions. I strongly prefer that in the absence of a signal to the contrary ( |
Yeah, I should be mostly free. Perhaps we can coordinate through discord/email on a time that works for all of us. Quick thoughts:
This is true. with the logic solver approach, the heuristics pricing becomes the more important problem than the solution generation, and most likely a problem that is more easy to be improperly implemented. Probably the main advantage to the solver approach is that it offloads the complexity of solving the dependency graph to the SAT solver and focuses on the heuristics, which allows for more flexibility compared to a deterministic algorithm. Practically, this might be a very rare need. I'm guessing performance with the algorithm probably isn't an issue for 99% of cases given that it also powers bundler and cargo. |
a small update in case it's helpful for our upcoming conversation/for anyone interested, I implemented a quick and dirty POC for the sat solver, just for solving all combinations - without the pricing heuristics - you can find details on #422 |
Fwiw the semver solver I prototyped on top of z3 a while ago when this came up in discussion for cargo allows for this; in general z3 permits weighting and optimal-solving by user-defined metrics, as of at least whenever the "z3opt" branch was integrated. It's definitely on trunk these days. Other MaxSAT solvers include ahmaxsat, toysat, qmaxsat, maxino, maxhs, openwbo ... |
For reference: rust-lang/cargo#2064 |
Relevant to this issue: https://research.swtch.com/version-sat |
We face that problem as well, and wrote a post-process tool to identify and resolve duplicated packages: https://www.npmjs.com/package/yarn-tools. |
The unreliable deduping is a real problem for us. We use a number of packages that misbehave if multiple versions run at the same time. Has there been any progress? |
because yarn is silly and frequently adds duplicate entries for things into yarn.lock when it could satisfy everything the app needs with one version; when you make any changes to package.json, run this script: `yarn upgrade-and-dedupe` to make sure that yarn.lock doesn't add needless duplicates of things. if yarn ever fixes this issue: yarnpkg/yarn#579 then this can go away. test plan: 1. check this out 2. run `yarn upgrade-and-dedupe` 3. push the change that commit makes 4. it should pass jenkins and not introduce any errors Change-Id: I8c268cc4c7c6ac4474221ce65129a6ef8b261813 Reviewed-on: https://gerrit.instructure.com/142447 Tested-by: Jenkins Reviewed-by: Clay Diffrient <cdiffrient@instructure.com> Product-Review: Ryan Shaw <ryan@instructure.com> QA-Review: Ryan Shaw <ryan@instructure.com>
To prevent duplicates in consumer bundles Change-Id: I5aceb0f151654efebd8d8b86e30e28cac11d37e2 Reviewed-on: https://gerrit.instructure.com/158300 Tested-by: Jenkins Reviewed-by: Ryan Shaw <ryan@instructure.com> Product-Review: Jennifer Stern <jstern@instructure.com> QA-Review: Jennifer Stern <jstern@instructure.com>
Just wanted to keep people from thinking this is stale -- this is still a concern today! |
Whoa! Notifications about a 2016 comment === blast from the past! |
I wanted to open up this issue as a discussion item. I don't think there's anything blockery about this concern (more below), but I do think that the structure of the lockfile will become problematic in the long term.
First, here's a quick example of one of the consequences of the problem, from Yarn's own lockfile:
There are two dependencies described here (
wordwrap ~0.0.2
andwordwrap =0.0.2
). Both of them are satisfied bywordwrap 0.0.2
, but instead we get two different packages.This doesn't always happen; there is some code that tries to "dedup" packages:
But it's hardly the only example:
The
rimraf ^2.2.8
dependency could unify withrimraf ~2.2.6
, but it doesn't.Worse:
All of these versions of
recast
could unify torecast =0.10.33
, but instead we get two copies of recast.Another example:
All of these patterns can unify to
once 1.3.3
, but we get two copies.Hopefully I've made my point 😉
The current yarn strategy successfully unifies dependencies opportunistically, but doesn't have any backtracking mechanism, which is necessary for the unification opportunities described above. It also doesn't provide any way for resolvers to participate in the necessary backtracking. This means that even if we successfully unify the graph we have now via post-processing, there are likely even better opportunities that we'll miss.
The major change in the lockfile format I propose is that the lockfile represents a graph of packages, where the entries are exact package identifiers. This is a change from the current format, where the lockfile is a list of patterns and associated packages.
The reason I think the current status isn't an urgent priority is that a list of seen patterns and associated packages is still deterministic, which is the number one priority of the lockfile. Lower priorities, like conservative updating (
yarn update some-pkg
changes the minimal necessary subgraph) depend on a more traditional lockfile, but we can work on it over time.Ultimately, I think a simplified version of the Cargo strategy (the Cargo strategy supports a nice optional dependency mechanism called "features" which we don't support, and which complicates the algorithm somewhat) is the right way to go. The Cargo strategy is a refined version of the Bundler strategy, which was also used (and further refined) by Cocoapods. The Cargo strategy adds support for package duplication, which is necessary (and desired) in the Node ecosystem.
This would mean version-bumping the Yarn lockfile, but the good news is that Yarn's lockfile is already versioned, so this should be fine.
The Rough Strategy
Here's a rough description of the algorithm. (it's slightly modified from Cargo and Bundler's algorithm because of the amount of duplication expected as a matter of course in the npm ecosystem).
Dependency resolution always starts with a top-level
package.json
, which has a list of dependencies.Each dependency (
once@~1.3.0
) expands to a set of possible packages, determined by querying all of the available registries. If a dependency describes a non-registry source (a git repo or tarball, for example), the dependency expands to a set of just one package (in bundler, this is called a "pinned" dependency).The first step in the algorithm adds a list of all dependencies in the top-level package.json to a
remaining dependencies
list as a tuple of(dependency, candidate-list)
.Initialize an
all packages
Map whose keys are package names and whose values are Sets of all available packages with that name.Repeatedly activate a dependency by popping a dependency from the list of remaining dependencies:
all packages
with the name of the dependency to the set of packages that match the dependency.If a dependency cannot be unified with an existing instance of the same package, consider backtracking if:
all packages
still available contains packages that could satisfy the dependencyflat
If the
--flat
flag was passed toyarn
, always backtrack if a dependency cannot be satisfied with an existing package with the same name in the set of activated packages. This will effectively turn the algorithm into Bundler, which guarantees the "highlander rule" ("there can be only one" package per name in the graph).Keep going until the set of remaining dependencies is empty. At this point, a complete graph of unique packages will exist, and be ready to be written out to a lockfile.
A note about backtracking: because dependency graphs with thousands of packages are common, it is not advisable to use recursion when implementing this algorithm in a language with limited stack depth. Instead, we would maintain a side stack (see the same algorithm in Cargo) that allows us to implement the recursive algorithm without blowing the stack.
The text was updated successfully, but these errors were encountered: