Skip to content
This repository has been archived by the owner on Mar 31, 2022. It is now read-only.

resume repairs automatically after errors #106

Open
marcusb opened this issue Jun 29, 2015 · 2 comments
Open

resume repairs automatically after errors #106

marcusb opened this issue Jun 29, 2015 · 2 comments

Comments

@marcusb
Copy link
Member

marcusb commented Jun 29, 2015

The repair often ends up in ERROR state if nodes are down or restarted. Sometimes the message is "Exception: null". After this happens, the repair must be resumed manually with spreaper. It would be preferable if it would resume automatically perhaps after some delay.

@rzvoncek
Copy link
Contributor

It looks like the "Exception: null" happened when one of the JMX calls failed (for mundane reasons).

#107 adds extra check for this, as well as automatically resumes a run that is in ERROR.

@Bj0rnen
Copy link
Contributor

Bj0rnen commented Jul 15, 2015

We tweaked our approach to this. We agreed that ERROR should mean nothing else than "unrecoverable error", and simply don't set the repair run to that state unless it's a known unrecoverable (repair segment mismatch with cluster topology is the only known one for now). Now we keep retrying if the run is hit by exceptions that we don't handle anywhere.

Hopefully that doesn't become a problem in and off itself. Better than retrying when we already know that it's not going to work at least.

Bj0rnen pushed a commit that referenced this issue Jun 21, 2017
…original 'reaper_ui' so we can actually use webpack dev server and hot reload when working on UI. (#106)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants