Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify dependency fetching #197

Merged
merged 2 commits into from
Aug 30, 2018
Merged

Unify dependency fetching #197

merged 2 commits into from
Aug 30, 2018

Conversation

schomatis
Copy link
Collaborator

@schomatis schomatis commented Aug 8, 2018

gxutil: unify dependency fetching in `InstallDeps`

The previous implementation spawned multiple goroutines to fetch packages but
for the dependencies of one package at a time. This meant that the last
goroutine that took longer to finish (probably because it was fetching the
biggest dependency) held the start of more goroutines to fetch the dependencies
of the packages already fetched.

Unify all the dependencies in a single queue (abstracted in the new
`DependencyQueue` structure) so as soon as new dependencies of a fetched package
are available new goroutines can be spawned to fetch them.

Raise the number of parallel goroutines fetching packages (now `maxGoroutines`)
from 2 to 20 now that this bigger number can be leveraged from this unification.

Remove the retry logic around `GetPackageTo` (as that function already contains
code with equivalent functionality) to simplify the fetching part of the
function.

Split `InstallDeps`

Split the `InstallDeps` function in two functions which represent clear separate
processes: fetching packages (`fetchDependencies`) and running the `post-
install` hook on them (`dependenciesPostInstall`).

The `fetchDependencies` function is the previous `installDeps` function (now
renamed) without the `maybeRunPostInstall` call, which was encapsulated in its
own `dependenciesPostInstall` function that replicates the same DFS traversal of
the dependency DAG to run the hook.

This commit permits a more clear understanding of the process of installing
dependencies which clears the road for the unification of the fetching functions
into one.

There is a count discrepancy bug in the `ProgMeter` introduced in this commit
but it will be fixed in the next commit when `fetchDependencies` is refactored
to eliminate the recursive implementation.

Closes #195.

@schomatis
Copy link
Collaborator Author

Unrelated test error fixed in #198, needs to rebase after merge.

@schomatis
Copy link
Collaborator Author

I have to finish the documentation and fix some minor details but this change already shows a considerable performance improvement. Before finishing this up I need a green light from @whyrusleeping to raise the ratelim (now called maxGoroutines) from the original value of 2 (now 20, but could be any number) as I'm not sure what was the original rationale for this low number, but raising it was the original motivation behind this entire PR.

@schomatis
Copy link
Collaborator Author

@Stebalien WDYT about raising the number of goroutines fetching dependencies?

gxutil/pm.go Outdated
if addedDep[dep.Hash] == false {
addedDep[dep.Hash] = true
depQueue = append(depQueue, dep)
fmt.Printf("Adding dependency %s (%s)\n", dep.Name, dep.Hash[:6])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for these print statements?

gxutil/get.go Outdated
@@ -59,6 +59,7 @@ func (pm *PM) GetPackageTo(hash, out string) (*Package, error) {
tries := 3
for i := 0; i < tries; i++ {
if err := pm.Shell().Get(hash, outtemp); err != nil {
panic(err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging?

@Stebalien
Copy link
Collaborator

WDYT about raising the number of goroutines fetching dependencies?

If it helps, do it.

@schomatis schomatis removed the request for review from whyrusleeping August 16, 2018 02:45
Split the `InstallDeps` function in two functions which represent clear separate
processes: fetching packages (`fetchDependencies`) and running the `post-
install` hook on them (`dependenciesPostInstall`).

The `fetchDependencies` function is the previous `installDeps` function (now
renamed) without the `maybeRunPostInstall` call, which was encapsulated in its
own `dependenciesPostInstall` function that replicates the same DFS traversal of
the dependency DAG to run the hook.

This commit permits a more clear understanding of the process of installing
dependencies which clears the road for the unification of the fetching functions
into one.

There is a count discrepancy bug in the `ProgMeter` introduced in this commit
but it will be fixed in the next commit when `fetchDependencies` is refactored
to eliminate the recursive implementation.
@schomatis schomatis requested a review from Stebalien August 17, 2018 00:56
@schomatis
Copy link
Collaborator Author

@Stebalien Ready for review.

@schomatis schomatis changed the title [WIP] Unify dependency fetching Unify dependency fetching Aug 19, 2018
Copy link
Collaborator

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice, clean patch. Thanks! The only thing we really need to change is waiting on the rest of the in-progress fetches to finish (so we don't write partial packages). Everything else is just my opinion.

gxutil/pm.go Outdated

// GetDependency pops the first dependency in the queue and
// returns it (or `nil` if the queue is empty).
func (dq *DependencyQueue) GetDependency() *Dependency {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Pop.

gxutil/pm.go Outdated
case err := <-fetchErrs:
Error("parallel fetch: %s", err)
return err
// The rest of the goroutines already running are guaranteed to have
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still wait. Otherwise, we'll quit the program while they're still running (and we'll end up with partial packages).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll fix that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking here that if I have an error I break out of the main for and do a separate one just to receive from the channels until no more goroutines are running (but do nothing else, don't move the progression meter). In this case I would return the first error received from the goroutines and the others would be discarded (just reported through the Error() log but not returned to the user). Is that ok?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

// structure of `fetchDependencies` but right now the `post-install` hook
// of the only sub-tool (`gx-go rewrite`) already does a parallel processing
// of its own, so there's little to gain here.
func (pm *PM) dependenciesPostInstall(pkg *Package, location string) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not do this while downloading? Not a big issue but I believe it would simplify things.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it would simplify things.

Hmm, could you expand on how do you see it as a simplification? I actually split it because it made it easy to parallelize the fetching operations without having to think about the posterior rewrite process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that we could do it in the same worker. That is:

  1. Download a package.
  2. Run post-install immediately.

That way we don't need to traverse the tree twice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other issue (I'm just remembering now, sorry) is that because you need the transitive dependencies when computing rewrites (as you we're discussing in whyrusleeping/gx-go#38 (comment)), the post-install hook needs a particular order in the DAG traversing (basically that you already fetched all the dependencies -directs and transitives- of that package before rewriting it) and that constraint invalidates the parallel fetch that's happening here (process the first dependency fetched -don't really care where it is in the DAG- and order the fetch of its dependencies) where the DAG is traversed in a somewhat stochastic manner.

(I should add this explanation to the commit message.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yeah, that makes sense.


// Keep spawning goroutines until the allowed maximum or until
// there are no new dependencies to fetch at the moment.
for activeGoroutines < maxGoroutines && depQueue.Len() > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I usually do this by spawning N goroutines up-front, having them select on a context cancellation and a work channel. You can then use a sync.WaitGroup to, e.g., wait for all the active goroutines to exit when the context is canceled. This also saves the GC some work.

However, the code is clean as-is so there's no need to do it this way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point me to an example of that pattern in the code base to have a reference? My main motivation here is to have the code as simple as possible (the bottleneck here is, by far, the network I/O), so if it seems simpler than this I'll use that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called the worker pattern: https://snippets.aktagon.com/snippets/755-golang-worker-pattern-example, https://gobyexample.com/worker-pools

However, that really isn't necessary in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference I'll keep it in mind.

The problem here is that the work (each fetch) is not independent of each other, I need the result of that work to decide if I give more work to the workers (new dependencies -of the already fetched ones- to be fetched) so I need some control mechanism to decide when to finish (because I need to ask the worker: are you actually fetching something right now? in that case I'll wait for that output, or if you're idle -all of them are idle- then I can close the channel and finish).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usual solution to that is usually counting (count outstanding tasks). But really, the current solution is just fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yes, I was counting goroutines when I should have been counting job requests.

The previous implementation spawned multiple goroutines to fetch packages but
for the dependencies of one package at a time. This meant that the last
goroutine that took longer to finish (probably because it was fetching the
biggest dependency) held the start of more goroutines to fetch the dependencies
of the packages already fetched.

Unify all the dependencies in a single queue (abstracted in the new
`DependencyQueue` structure) so as soon as new dependencies of a fetched package
are available new goroutines can be spawned to fetch them.

Raise the number of parallel goroutines fetching packages (now `maxGoroutines`)
from 2 to 20 now that this bigger number can be leveraged from this unification.

Remove the retry logic around `GetPackageTo` (as that function already contains
code with equivalent functionality) to simplify the fetching part of the
function.
@schomatis
Copy link
Collaborator Author

@Stebalien Fixed. Thanks for catching the goroutines bug!

@whyrusleeping
Copy link
Owner

Oh, this is nice! Thanks @schomatis!

I hacked a towards making this a tiny bit cleaner in #206, but this is far cleaner. If @Stebalien is happy with the changes, i'd be down to merge this, then rebase #206 on top and make the 'cache fetch linking' stuff there use this new code.

@Stebalien
Copy link
Collaborator

Yes, this looks good now.

@Stebalien Stebalien merged commit c590921 into whyrusleeping:master Aug 30, 2018
@schomatis schomatis deleted the feat/install-deps/unify-fetch branch August 30, 2018 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a single queue to install dependencies in installDeps
3 participants