Search branch until finding something out of date in parallel #168

kendonB · 2017-11-23T07:30:10Z

I just ran into a case where a potential parallelisation wasn't realised.

The dependency structure was something like:

A -> B -> D -> E
A -> C -> F

A, B, C, and D are up to date. Drake checked A, BC, DF, built F, then checked and built E.

I think the most efficient behavior would be:
A, B, C, and D are up to date. Drake checks A, BC, DF, E, then builds EF in parallel.

wlandau-lilly · 2017-11-23T14:35:54Z

I see your point, but I do not know how to solve this one. We would need a whole new parallel graph traversal algorithm, and then some of the most sensitive internals would need refactoring. It is a huge undertaking.

But it may be possible. Somehow, Make already knows how to do this, which means "Makefile" parallelism in drake should not have this problem. I submitted a Stack Overflow post to ask how exactly that works. In the meantime, if you can suggest a parallel graph traversal algorithm, that would help things along.

wlandau-lilly · 2017-11-24T04:43:52Z

A new parallel graph traversal algorithm will solve, accelerate, clean, and simplify a lot, including this issue. I sketched one out at https://github.com/wlandau-lilly/drake/tree/parallel_algo (see parallel_worker()). There is a fixed number of persistent workers that stay going for the entire make(). Each worker completes its current target and then searches for the next unclaimed target with completed dependencies. Issues:

Race conditions. I try to make use of the "progress" storr namespace to implement locking, but sometimes I get errors or hanging jobs. I do not yet know why.
Memory management. I have not decided on an alternative to prune_envir(). Could be Consider passing target arguments as promises? #167.
Having the same persistent workers for the entire make() means long wall times. Users of resource managers such as SLURM may not be able to request wall times long enough, depending on the user privileges granted by their respective systems. On the other hand, reusing existing workers may remove much of the annoying overhead of starting and stopping jobs. The TORQUE example might even work on my botched local installation.
"Makefile" parallelism: should it still spawn one job per target, or should it reuse the persistent workers? For the sake of the wall time issue, I lean towards the former.

kendonB · 2017-11-24T04:53:54Z

I don't think assuming persistent workers is a problem. Users can control wall times by building subsets of myplan$targets, as I have been doing switching memory and wall time configurations for different sets of targets.

However, you might have a future ambition to build in target-level future.batchtools configurations within myplan to avoid the user having to make multiple calls to make. I wouldn't try the latter feature until the dependent packages are a bit more mature.

kendonB · 2017-11-24T05:24:42Z

I just had a closer look a this and I think that building from your original approach is the way to go. I'm imagining running a job with hundreds of workers all fumbling to get going on the same targets. It'll be a mess. File locking surely isn't a perfect strategy.

I can't imagine that this approach can work with many workers :/

Trimming graphs of up to date targets at each parallel_stage seems like it can work and wouldn't be too complicated to implement?

wlandau-lilly · 2017-11-24T05:45:22Z

I hope you are right. It is definitely easier to stick with parallel stages and one job per target, and in the short term, it is much safer. Plus, with a fixed set of permanent workers, your idea of target-level hpc configurations would be impossible.

I just feel bad that drake's graph traversal algorithm is less aggressive than that of Make itself. With lapply()-like backends, drake needs to finish all the jobs in a stage before starting any of the next targets. Make, on the other hand, can just fork new jobs on the fly whenever a worker finishes and the graph opens up. It may be that drake's non-"Makefile" backends are just more constrained than Make and there is nothing we can do about it. But it could also be that I just naively dove into the first algorithm that came to mind without actively searching for enough other possibilities. Maybe future without future_lapply() could be more aggressive, I am not sure.

So we can stick to the original intent of this issue. My company's hpc systems will be down this weekend, and I am spending most of my time with family right now anyway, so my own work will be more delayed than usual.

kendonB · 2017-11-24T07:00:46Z

future without future_lapply can likely achieve aggressive parallelism like you describe. I'm still wrapping my head around that package, but it seems like this is exactly the use case they were thinking of. However, many users (at least me) would want to be able to switch plans between sets of targets. All the best to you and your family over the holiday period, Will! Stop working!

This is part of the solution to #168: if imports and targets are processed together, the full power of parallelism is taken away from the targets. Also, the way parallelism happens is now consistent for all parallel backends.

wlandau-lilly · 2017-11-25T06:08:29Z

Thanks for the encouragement, Kendon! I am taking a lot of time off.

Part of the solution to this issue will be to always process all the targets before all the imports, no matter what parallel backend is used. This will simplify, fortify, and standardize the parallel execution routines, and it solves a similar problem: for local parallelism, sometimes imported functions are lined up with true targets, which detracts from the parallelism of the targets themselves.

Also, this issue requires a deep update to max_useful_jobs() and rate_limiting_times(), which eventually led me to #170.

kendonB · 2017-11-25T07:07:25Z

Will this still allow drake to check targets at each parallel stage? You would want to check targets right before they get built in case rebuilding a dependency doesn't actually change that dependency.

wlandau-lilly · 2017-11-25T07:09:15Z

Absolutely. In fact, unnecessary checking is removed elsewhere so drake relies more heavily on last-minute checking in each stage.

wlandau-lilly · 2017-11-25T22:25:47Z

Fixed, unit tested, and mentioned in the vignettes and README.

wlandau-lilly added the difficulty: advanced label Nov 23, 2017

wlandau-lilly mentioned this issue Nov 23, 2017

Consider passing target arguments as promises? #167

Closed

wlandau-lilly added the status: priority label Nov 24, 2017

wlandau-lilly added a commit that referenced this issue Nov 24, 2017

Sketch compilcated solution to #168

8de2127

wlandau-lilly added a commit that referenced this issue Nov 24, 2017

Simple slow solution to #168

6a50067

wlandau-lilly closed this as completed Nov 25, 2017

wlandau-lilly mentioned this issue Nov 26, 2017

Superfluous downstream checks in Makefile parallelism #173

Closed

This was referenced Jun 29, 2018

Support clustermq as backend for drake mschubert/clustermq#86

Closed

add future::nbrOfWorkers to future.R #438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search branch until finding something out of date in parallel #168

Search branch until finding something out of date in parallel #168

kendonB commented Nov 23, 2017

wlandau-lilly commented Nov 23, 2017

wlandau-lilly commented Nov 24, 2017 •

edited

Loading

kendonB commented Nov 24, 2017

kendonB commented Nov 24, 2017

wlandau-lilly commented Nov 24, 2017 •

edited

Loading

kendonB commented Nov 24, 2017

wlandau-lilly commented Nov 25, 2017

kendonB commented Nov 25, 2017 via email •

edited

Loading

wlandau-lilly commented Nov 25, 2017

wlandau-lilly commented Nov 25, 2017

Search branch until finding something out of date in parallel #168

Search branch until finding something out of date in parallel #168

Comments

kendonB commented Nov 23, 2017

wlandau-lilly commented Nov 23, 2017

wlandau-lilly commented Nov 24, 2017 • edited Loading

kendonB commented Nov 24, 2017

kendonB commented Nov 24, 2017

wlandau-lilly commented Nov 24, 2017 • edited Loading

kendonB commented Nov 24, 2017

wlandau-lilly commented Nov 25, 2017

kendonB commented Nov 25, 2017 via email • edited Loading

wlandau-lilly commented Nov 25, 2017

wlandau-lilly commented Nov 25, 2017

wlandau-lilly commented Nov 24, 2017 •

edited

Loading

wlandau-lilly commented Nov 24, 2017 •

edited

Loading

kendonB commented Nov 25, 2017 via email •

edited

Loading