Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node sampling history of strategy is erased when nodes fail #1127

Open
sharkovsky opened this issue Jun 6, 2024 · 3 comments
Open

Node sampling history of strategy is erased when nodes fail #1127

sharkovsky opened this issue Jun 6, 2024 · 3 comments
Assignees
Labels
bug this issue is about reporting and resolving a suspected bug doing issue implementation in progress product backlog the product owner adds an entry to the product backlog, NEEDS EXPLICIT APPROVAL FROM THE PO

Comments

@sharkovsky
Copy link
Member

When nodes fail during an experiment, they are also removed from the sampling history of the strategy class.
This should not happen, because the purpose of the history is to keep track of which nodes had been selected by the strategy, not only those that didn't fail.

Steps to reproduce

on develop commit 850f73d
Based on 101 notebook

  1. insert assert False in the training_step of the training plan. This simulates a node failure.
  2. execute exp.run
  3. the output of exp.strategy()._sampling_node_history is {0: []}, while it should contain the ids of the selected nodes.

Furthermore, the experiment fails with FB401: aggregation crashes or returns an error. Aggregation aborted due to sum of the weights is equal to 0 {}. Sample sizes received from nodes might be corrupted..
However, the fact that no nodes succeeded should be caught before the aggregation is even attempted.

@sharkovsky sharkovsky added bug this issue is about reporting and resolving a suspected bug doing issue implementation in progress sprint backlog the development team adds an entry to the sprint backlog labels Jun 6, 2024
@sharkovsky sharkovsky self-assigned this Jun 6, 2024
@sharkovsky
Copy link
Member Author

@lchambon and myself will look at this together

@mvesin mvesin added product backlog the product owner adds an entry to the product backlog, NEEDS EXPLICIT APPROVAL FROM THE PO and removed sprint backlog the development team adds an entry to the sprint backlog labels Aug 2, 2024
@manos-mark
Copy link

manos-mark commented Aug 26, 2024

I think I have the same problem in a real-world experiment. I got the error

Cannot instantiate training plan object

and now I can't reverse it. Even if I changed the training plan to a previous "working" version.
I work with VPN settings and I tried to restart the node and the researcher but there is no difference.

@mvesin
Copy link
Member

mvesin commented Aug 26, 2024

Hi @manos-mark
I'm not sure to understand how your issue is related to the previous one. Could you give a few info about:

  • when does the error occur: aggregation on the researcher ? other ?
  • exp.strategy()._sampling_node_history ?
  • exp.all_federation_nodes() ?

Thank you for reporting !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug this issue is about reporting and resolving a suspected bug doing issue implementation in progress product backlog the product owner adds an entry to the product backlog, NEEDS EXPLICIT APPROVAL FROM THE PO
Projects
None yet
Development

No branches or pull requests

3 participants