Provenance #918

wlandau · 2019-06-22T01:55:36Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.

Description

I think drake can achieve something very similar to MLFlow tracking because storr is resistant to actually deleting data, even after drake::clean() (which just removes hashes from key files). All we need to do is store data hashes in the target metadata. Then, we can look at both current and past targets and distinguish between them. In addition, we can borrow from drake's existing code analysis magic to automatically detect parameter settings from commands in the drake_plan() (let's just stick to scalars for now). No need to manually declare "I want to track tuning parameters alpha and beta".

The text was updated successfully, but these errors were encountered:

wlandau · 2019-06-22T01:56:12Z

We should also show the date and time that each target was built.

wlandau · 2019-06-22T19:46:48Z

Branch: https://github.com/ropensci/drake/tree/918. storr alone does not solve the problem, so we are using txtq to log target history in a threadsafe manner.

drake will disable history by default. Users will need to manually call make(history = TRUE). Reasons:

Currently, txtq is in Suggests in the DESCRIPTION file. If we tracked history by default, we would have to move txtq to Imports.
txtq uses file locking to avoid race conditions, so it creates a parallel bottleneck. Should not be a big deal, but I believe performance is more important than history.
Even in serial execution, there is a ~15% performance penalty for thousands of small targets. Below is a flame graph using the overhead example with 4096 targets, most with 64 dependencies each.

wlandau · 2019-06-26T12:55:17Z

Changing my mind about #918 (comment). drake now tracks history by default: fe279e6. Reasons:

History is one of those things you always don't know you need until it is too late, and its such an important issue that a tiny performance hit is worth it.
txtq is a small package, so it is not such a big deal to depend on it. After Remove superfluous dependencies wlandau/txtq#12, even less so.
txtq is not the only atomic operation in drake's parallel computing. Each parallel backend has a master process, after all.
push() in txtq is already much faster: Speed up push() wlandau/txtq#11

wlandau added type: new feature status: priority topic: api topic: reproducibility labels Jun 22, 2019

wlandau self-assigned this Jun 22, 2019

wlandau mentioned this issue Jun 22, 2019

Efficient track of parameters/ artifacts for different models #891

Closed

wlandau mentioned this issue Jun 23, 2019

History and provenance #920

Merged

4 tasks

wlandau closed this as completed in 505884f Jun 23, 2019

wlandau mentioned this issue Jun 25, 2019

Speed up push() wlandau/txtq#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provenance #918

Provenance #918

wlandau commented Jun 22, 2019 •

edited

Loading

wlandau commented Jun 22, 2019

wlandau commented Jun 22, 2019 •

edited

Loading

wlandau commented Jun 26, 2019

Provenance #918

Provenance #918

Comments

wlandau commented Jun 22, 2019 • edited Loading

Prework

Description

wlandau commented Jun 22, 2019

wlandau commented Jun 22, 2019 • edited Loading

wlandau commented Jun 26, 2019

wlandau commented Jun 22, 2019 •

edited

Loading

wlandau commented Jun 22, 2019 •

edited

Loading