Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge overhead in adjacent_vertices in a moderately sized project #435

Closed
kendonB opened this issue Jun 29, 2018 · 12 comments
Closed

Huge overhead in adjacent_vertices in a moderately sized project #435

kendonB opened this issue Jun 29, 2018 · 12 comments

Comments

@kendonB
Copy link
Contributor

kendonB commented Jun 29, 2018

The plan I'm running is quite simple: I have a 20000 element list of smallish spatial data objects and I'm simply assigning each of the 20000 to a target.

There's a bottleneck in:
run_loop ->
build_check_store ->
prune_envir ->
downstream_deps <- nonfile_target_dependencies(targets = downstream, ....) ->
dependencies ->
adjacent_vertices ->
res <- lapply(res, function(x) create_vs(graph, x + 1))

In this example, the vast majority of the computation time seems to be taken up by this lapply in adjacent_vertices. I wonder if it's possible to move this work up to an earlier stage and it not use the inefficient R loop that it ultimately depends on?

@kendonB
Copy link
Contributor Author

kendonB commented Jun 29, 2018

Ironically, if I change pruning_strategy from speed to memory, my plan makes much faster (maybe 4-5x).

@wlandau
Copy link
Member

wlandau commented Jun 29, 2018

Since adjacent_vertices() is part of igraph, the solution will not be easy. This brings up and clarifies some thoughts I have had in the back of my mind about streamlining drake's internals. I will work on #440 first.

@wlandau
Copy link
Member

wlandau commented Jul 4, 2018

Looking at the traceback, I am less sure that #440 will solve this. Environments cannot be subsetted with vectors of indices, which means we would end up calling the equivalent of adjacent_vertices() using vectorization or an lapply()-like function over the large downstream argument to prune_envir(). Any parallelism we add would not be available in build_check_store() because it is already being applied in parallel.

A different tack: rethink the pruning strategies.

  • speed: just load the current target's dependencies and do not worry about anything downstream. Candidate for the new default.
  • lookahead: Look at all the downstream targets and unload anything they won't need. This is the current default, and it is currently named "speed".
  • memory: for each target, load the dependencies and unload everything else.

I am not actually sure we need lookahead pruning, and it certainly complicates the code, but we should probably keep it for now in case some projects need it.

@kendonB
Copy link
Contributor Author

kendonB commented Jul 4, 2018

The loop that's the bottleneck also depends on igraph_opt("return.vs.es") being TRUE. Are you sure you actually need this option?

@wlandau
Copy link
Member

wlandau commented Jul 4, 2018

Wow, things really speed up if igraph_opt("return.vs.es") is FALSE!

suppressPackageStartupMessages(library(igraph))
n <- 20000
g <- sample_k_regular(n, k = n / 4, directed = TRUE)
v <- seq_len(n / 2)
igraph_options(return.vs.es = TRUE)
bench::mark(adjacent_vertices(g, v = v))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 x 10
#>   expression        min  mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>          <bch:> <bch> <bch:> <bch>     <dbl> <bch:byt> <dbl> <int>
#> 1 adjacent_vert…   1.9s  1.9s   1.9s  1.9s     0.526    3.73GB    18     1
#> # ... with 1 more variable: total_time <bch:tm>
igraph_options(return.vs.es = FALSE)
bench::mark(adjacent_vertices(g, v = v))
#> # A tibble: 1 x 10
#>   expression        min  mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>          <bch:> <bch> <bch:> <bch>     <dbl> <bch:byt> <dbl> <int>
#> 1 adjacent_vert…  191ms 192ms  192ms 192ms      5.21     382MB     0     3
#> # ... with 1 more variable: total_time <bch:tm>

@gaborcsardi, will the return.vs.es option stick around for the long term, or will it eventually be deprectated? From the igraph_options help file:

return.vs.es

Whether functions that return a set or sequence of vertices/edges should return formal vertex/edge sequence objects. This option was introduced in igraph version 1.0.0 and defaults to TRUE. If your package requires the old behavior, you can set it to FALSE in the .onLoad function of your package, without affecting other packages.

@gaborcsardi
Copy link

It will not be removed.

@wlandau
Copy link
Member

wlandau commented Jul 4, 2018

Fantastic! @kendonB, would you try 5cfd7e9?

@wlandau
Copy link
Member

wlandau commented Jul 6, 2018

pruning_strategy now has 3 choices, all documented in ?config and ?make. In case igraph_options(return.vs.es = FALSE) is not enough, make(pruning_strategy = "speed") avoids dependency lookups as much as possible. Even without #440, we have covered all our bases.

  • "lookahead" (default): keep loaded targets in memory until they are no longer needed as dependencies in downstream build steps. Then, unload them from the environment. This step avoids keeping unneeded data in memory and minimizes expensive reads from the cache. However, it requires looking ahead in the dependency graph, which could add overhead for every target of projects with lots of targets.
  • "speed": Once a target is loaded in memory, just keep it there. Maximizes speed, but hogs memory.
  • "memory": For each target, unload everything from memory except the target's direct dependencies. Conserves memory, but sacrifices speed because each new target needs to reload any previously unloaded targets from the cache.

@idavydov
Copy link

Hi, not sure if that's the same issue or not.

I have a large a plan (~5k targets). When I use default make(), it takes a couple of minutes before the first job is submitted.

However if I use:

make(myplan,
      parallelism='clustermq', jobs=25,
      verbose=4,
      lazy_load=TRUE, pruning_strategy='memory', garbage_collection=TRUE,
      caching='worker')

It takes hours before clustermq starts spawning the workers.

The log looks like this:

cache /...
analyze environment
analyze 99 imports: a, b, c, d...
construct graph
import a
import b
import c
(and after all the imports it just stops)

The bottleneck seem to be somewhere in:

"mcfork" "FUN" "lapply" "parallel::mclapply" "safe_mclapply" "lightly_parallelize_atomic" "lightly_parallelize" "eval" "eval" "%>%" "bdg_analyze_imports" "force" "memo_expr" "build_drake_graph" "drake_config" "make"

It seems that prunning_strategy='speed' does not help much.

Is there anything that can be done?

@idavydov
Copy link

idavydov commented Oct 18, 2018

It seems that decreasing the number of jobs improves things dramatically.
Could it be that drake forks jobs number of times and overloads the host? Is it possible to get around this?

@wlandau
Copy link
Member

wlandau commented Oct 18, 2018

Wow, that really is severe! What about fewer jobs for local processing? When you write jobs = c(imports = 4, targets = 25), 4 jobs will also be used for preprocessing tasks such as graph construction (so maybe "local" and "remote" might have been better names than "imports" and "targets). jobs = 25 is the same as jobs = c(imports = 25, targets = 25), so I think this could indeed be overwhelming the host.

Not sure I know where else to find the bottleneck. The traceback you mentioned should actually be called before any of the imports are processed.

In any case, I am very interested in decreasing overhead for large numbers of targets.

@idavydov
Copy link

idavydov commented Oct 18, 2018

Wow, decreasing the number of jobs for imports helped a lot indeed.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants