Node sampling history of strategy is erased when nodes fail #1127
Labels
bug
this issue is about reporting and resolving a suspected bug
doing
issue implementation in progress
product backlog
the product owner adds an entry to the product backlog, NEEDS EXPLICIT APPROVAL FROM THE PO
When nodes fail during an experiment, they are also removed from the sampling history of the strategy class.
This should not happen, because the purpose of the history is to keep track of which nodes had been selected by the strategy, not only those that didn't fail.
Steps to reproduce
on
develop
commit 850f73dBased on 101 notebook
assert False
in thetraining_step
of the training plan. This simulates a node failure.exp.run
exp.strategy()._sampling_node_history
is{0: []}
, while it should contain the ids of the selected nodes.Furthermore, the experiment fails with
FB401: aggregation crashes or returns an error. Aggregation aborted due to sum of the weights is equal to 0 {}. Sample sizes received from nodes might be corrupted.
.However, the fact that no nodes succeeded should be caught before the aggregation is even attempted.
The text was updated successfully, but these errors were encountered: