HPC profiling #937

wlandau · 2019-07-10T13:13:48Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.

Description

I have a workflow on an SGE cluster where the command of each target takes 5-6 seconds to run but the total processing time is > 2 min. I think it is about time we did a profiling study on HPC. We can use Rprof() on both the master and a worker and visualize the results with profvis or pprof.

The text was updated successfully, but these errors were encountered:

brendanf · 2019-07-12T09:37:26Z

I can also run your test case on my SLURM cluster, in case the cluster scheduler makes a difference.

wlandau · 2019-07-12T16:13:35Z

Awesome! I will let you know when the time comes.

wlandau · 2019-08-10T11:56:17Z

Just uploaded a test case to https://github.com/wlandau/drake-examples/tree/master/hpc-profiling. Here are baseline results (without hpc). As expected, the bottlenecks are reading and writing the datasets (320 MB each). The next step is to try the same script on a cluster and show a flame graph.

library(dplyr)
library(drake)
library(jointprof) # remotes::install_github("r-prof/jointprof")
library(profile)

# Just throw around some medium-ish data.
create_plan <- function() {
  plan <- drake_plan(
    data1 = target(x, transform = map(i = !!seq_len(8))),
    data2 = target(data1, transform = map(data1, .id = FALSE)),
    data3 = target(data2, transform = map(data2, .id = FALSE)),
    data4 = target(data3, transform = map(data3, .id = FALSE)),
    data5 = target(bind_rows(data4), transform = combine(data4))
  )
  plan$format <- "fst"
  plan
}

# Profiling.
run_profiling <- function(
  path = "profile.proto",
  n = 2e7,
  parallelism = "loop",
  jobs = 1L
) {
  unlink(path)
  unlink(".drake", recursive = TRUE, force = TRUE)
  on.exit(unlink(".drake", recursive = TRUE, force = TRUE))
  x <- data.frame(x = runif(n), y = runif(n))               
  plan <- create_plan()
  rprof_file <- tempfile()
  Rprof(filename = rprof_file)
  options(
    clustermq.scheduler = "sge",
    clustermq.template = "sge_clustermq.tmpl"
  )
  make(
    plan,
    parallelism = parallelism,
    jobs = jobs,
    memory_strategy = "preclean",
    garbage_collection = TRUE
  )
  Rprof(NULL)
  data <- read_rprof(rprof_file)
  write_pprof(data, path)
}

# Visualize the results.
vis_pprof <- function(path, host = "localhost") {
  server <- sprintf("%s:%s", host, random_port())
  message("local pprof server: http://", server)
  system2(find_pprof(), c("-http", server, shQuote(path)))
}

# These ports should be fine for the pprof server.
random_port <- function(from = 49152L, to = 65355L) {
  sample(seq.int(from = from, to = to, by = 1L), size = 1)
}

# Run stuff.
path <- "profile.proto"
run_profiling(path, n = 2e7)
vis_pprof(path)

wlandau · 2019-08-11T03:45:14Z

Here's what the flame graph looks like on an HPC system (with, admittedly, n = 1e6 instead of 2e7). Apparently, not much time is spent sending data to sockets. Saving and loading the data is still the bottleneck, and I see no surprises given that we are using the "preclean" memory strategy. Maybe we need to test drive HPC with lots of small targets.

wlandau · 2019-08-11T04:22:07Z

I just tried a larger number of small targets on SGE, and there is no clear bottleneck. I think the biggest source of overhead is still reading and writing to the cache. Makes me question how much #933 (comment), mschubert/clustermq#147, and mschubert/clustermq#154 will really help drake workflows.

With #971, it should be possible to spoof the cache so nothing is saved at all, which should increase speed. I may implement this if someone requests it, but not until people are interested. Related: #575.

library(dplyr)
library(drake)
library(jointprof) # remotes::install_github("r-prof/jointprof")
library(profile)

create_plan <- function() {
  plan <- drake_plan(
    data1 = target(x, transform = map(i = !!seq_len(1e2))),
    data2 = target(data1, transform = map(data1, .id = FALSE)),
    data3 = target(data2, transform = map(data2, .id = FALSE))
  )
  plan
}

# Profiling.
run_profiling <- function(
  path = "profile.proto",
  n = 2e7,
  parallelism = "loop",
  jobs = 1L
) {
  unlink(path)
  clean(destroy = TRUE)
  x <- data.frame(x = runif(n), y = runif(n))               
  plan <- create_plan()
  rprof_file <- tempfile()
  Rprof(filename = rprof_file)
  options(
    clustermq.scheduler = "sge",
    clustermq.template = "sge_clustermq.tmpl"
  )
  make(
    plan,
    parallelism = parallelism,
    jobs = jobs,
    cache = storr::storr_environment(),
    history = FALSE,
    console_log_file = "drake.log"
  )
  Rprof(NULL)
  data <- read_rprof(rprof_file)
  write_pprof(data, path)
  clean(destroy = TRUE)
}

# Visualize the results.
vis_pprof <- function(path, host = "localhost") {
  server <- sprintf("%s:%s", host, random_port())
  message("local pprof server: http://", server)
  system2(find_pprof(), c("-http", server, shQuote(path)))
}

# These ports should be fine for the pprof server.
random_port <- function(from = 49152L, to = 65355L) {
  sample(seq.int(from = from, to = to, by = 1L), size = 1)
}

# Run stuff.
path <- "profile.proto"
run_profiling(path, n = 10, parallelism = "clustermq", jobs = 16)
vis_pprof(path)

wlandau added the topic: performance label Jul 10, 2019

wlandau added the status: priority label Jul 28, 2019

wlandau changed the title ~~HPC profiling studies~~ HPC profiling Jul 28, 2019

wlandau closed this as completed Aug 11, 2019

wlandau mentioned this issue Nov 25, 2019

Profiling studies on more platforms #1079

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPC profiling #937

HPC profiling #937

wlandau commented Jul 10, 2019

brendanf commented Jul 12, 2019

wlandau commented Jul 12, 2019

wlandau commented Aug 10, 2019

wlandau commented Aug 11, 2019

wlandau commented Aug 11, 2019

HPC profiling #937

HPC profiling #937

Comments

wlandau commented Jul 10, 2019

Prework

Description

brendanf commented Jul 12, 2019

wlandau commented Jul 12, 2019

wlandau commented Aug 10, 2019

wlandau commented Aug 11, 2019

wlandau commented Aug 11, 2019