Automatic cluster state export upon test timeout #5437
Labels
enhancement
Improve existing functionality or make things work better
flaky test
Intermittent failures on CI.
We are frequently bothered by unit tests timing out. Most of the timeout are not easily reproducible and are hard to debug. In the past, these timeouts could often be linked to a known issue, deadlocking the cluster. Often this deadlock can be investigated by inspecting the cluster state.
For a manual extraction of the cluster state, I once wrote #5068 which is trying to create a clean, serializable representation of the cluster.
Upon test timeout, this cluster dump could be persisted and archived as an artifact of the GH actions runner.
For instance, if
gen_cluster
fixture is used with a client, a timeout exception could be handled and the state is persisted, seedistributed/distributed/utils_test.py
Lines 954 to 964 in 6a0217e
The text was updated successfully, but these errors were encountered: