Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPC profiling #937

Closed
2 tasks done
wlandau opened this issue Jul 10, 2019 · 5 comments
Closed
2 tasks done

HPC profiling #937

wlandau opened this issue Jul 10, 2019 · 5 comments

Comments

@wlandau
Copy link
Member

wlandau commented Jul 10, 2019

Prework

Description

I have a workflow on an SGE cluster where the command of each target takes 5-6 seconds to run but the total processing time is > 2 min. I think it is about time we did a profiling study on HPC. We can use Rprof() on both the master and a worker and visualize the results with profvis or pprof.

@brendanf
Copy link
Contributor

I can also run your test case on my SLURM cluster, in case the cluster scheduler makes a difference.

@wlandau
Copy link
Member Author

wlandau commented Jul 12, 2019

Awesome! I will let you know when the time comes.

@wlandau wlandau changed the title HPC profiling studies HPC profiling Jul 28, 2019
@wlandau
Copy link
Member Author

wlandau commented Aug 10, 2019

Just uploaded a test case to https://github.com/wlandau/drake-examples/tree/master/hpc-profiling. Here are baseline results (without hpc). As expected, the bottlenecks are reading and writing the datasets (320 MB each). The next step is to try the same script on a cluster and show a flame graph.

library(dplyr)
library(drake)
library(jointprof) # remotes::install_github("r-prof/jointprof")
library(profile)

# Just throw around some medium-ish data.
create_plan <- function() {
  plan <- drake_plan(
    data1 = target(x, transform = map(i = !!seq_len(8))),
    data2 = target(data1, transform = map(data1, .id = FALSE)),
    data3 = target(data2, transform = map(data2, .id = FALSE)),
    data4 = target(data3, transform = map(data3, .id = FALSE)),
    data5 = target(bind_rows(data4), transform = combine(data4))
  )
  plan$format <- "fst"
  plan
}

# Profiling.
run_profiling <- function(
  path = "profile.proto",
  n = 2e7,
  parallelism = "loop",
  jobs = 1L
) {
  unlink(path)
  unlink(".drake", recursive = TRUE, force = TRUE)
  on.exit(unlink(".drake", recursive = TRUE, force = TRUE))
  x <- data.frame(x = runif(n), y = runif(n))               
  plan <- create_plan()
  rprof_file <- tempfile()
  Rprof(filename = rprof_file)
  options(
    clustermq.scheduler = "sge",
    clustermq.template = "sge_clustermq.tmpl"
  )
  make(
    plan,
    parallelism = parallelism,
    jobs = jobs,
    memory_strategy = "preclean",
    garbage_collection = TRUE
  )
  Rprof(NULL)
  data <- read_rprof(rprof_file)
  write_pprof(data, path)
}

# Visualize the results.
vis_pprof <- function(path, host = "localhost") {
  server <- sprintf("%s:%s", host, random_port())
  message("local pprof server: http://", server)
  system2(find_pprof(), c("-http", server, shQuote(path)))
}

# These ports should be fine for the pprof server.
random_port <- function(from = 49152L, to = 65355L) {
  sample(seq.int(from = from, to = to, by = 1L), size = 1)
}

# Run stuff.
path <- "profile.proto"
run_profiling(path, n = 2e7)
vis_pprof(path)

Screenshot_20190810_075530

@wlandau
Copy link
Member Author

wlandau commented Aug 11, 2019

Here's what the flame graph looks like on an HPC system (with, admittedly, n = 1e6 instead of 2e7). Apparently, not much time is spent sending data to sockets. Saving and loading the data is still the bottleneck, and I see no surprises given that we are using the "preclean" memory strategy. Maybe we need to test drive HPC with lots of small targets.

warmup

@wlandau
Copy link
Member Author

wlandau commented Aug 11, 2019

I just tried a larger number of small targets on SGE, and there is no clear bottleneck. I think the biggest source of overhead is still reading and writing to the cache. Makes me question how much #933 (comment), mschubert/clustermq#147, and mschubert/clustermq#154 will really help drake workflows.

With #971, it should be possible to spoof the cache so nothing is saved at all, which should increase speed. I may implement this if someone requests it, but not until people are interested. Related: #575.

library(dplyr)
library(drake)
library(jointprof) # remotes::install_github("r-prof/jointprof")
library(profile)

create_plan <- function() {
  plan <- drake_plan(
    data1 = target(x, transform = map(i = !!seq_len(1e2))),
    data2 = target(data1, transform = map(data1, .id = FALSE)),
    data3 = target(data2, transform = map(data2, .id = FALSE))
  )
  plan
}

# Profiling.
run_profiling <- function(
  path = "profile.proto",
  n = 2e7,
  parallelism = "loop",
  jobs = 1L
) {
  unlink(path)
  clean(destroy = TRUE)
  x <- data.frame(x = runif(n), y = runif(n))               
  plan <- create_plan()
  rprof_file <- tempfile()
  Rprof(filename = rprof_file)
  options(
    clustermq.scheduler = "sge",
    clustermq.template = "sge_clustermq.tmpl"
  )
  make(
    plan,
    parallelism = parallelism,
    jobs = jobs,
    cache = storr::storr_environment(),
    history = FALSE,
    console_log_file = "drake.log"
  )
  Rprof(NULL)
  data <- read_rprof(rprof_file)
  write_pprof(data, path)
  clean(destroy = TRUE)
}

# Visualize the results.
vis_pprof <- function(path, host = "localhost") {
  server <- sprintf("%s:%s", host, random_port())
  message("local pprof server: http://", server)
  system2(find_pprof(), c("-http", server, shQuote(path)))
}

# These ports should be fine for the pprof server.
random_port <- function(from = 49152L, to = 65355L) {
  sample(seq.int(from = from, to = to, by = 1L), size = 1)
}

# Run stuff.
path <- "profile.proto"
run_profiling(path, n = 10, parallelism = "clustermq", jobs = 16)
vis_pprof(path)

Screen Shot 2019-08-11 at 12 13 52 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants