Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to check file contents causes node to crash during start up [JIRA: RIAK-2162] #708

Open
jenafermiller opened this issue Sep 3, 2015 · 2 comments

Comments

@jenafermiller
Copy link

In lines 1364-1375 of gen_leader.erl, line 1366 verifies that the file needed to start up the replication leader exists and line 1371 reads the file. However, if the file is empty, corrupted, or otherwise contains bytes that do not form a valid external term format line 1372 will throw an exception. This exception causes the gen_leader start up process to be killed which, in turn, causes the riak_repl process to fail to start, and eventually the entire node is taken down during start up.

To fix this issue, the current version of line 1372 could be replaced with the following line so that any non-integer value or error is ignored. This will prevent Riak from starting when the disk is full, when the data directory is read only, etc. however will not ignore file read errors or file write errors.

Incarn = case catch binary_to_term(Bin) of
            I when is_integer(I) -> I;
            _ -> 1
         end,
@Basho-JIRA Basho-JIRA changed the title Failure to check file contents causes node to crash during start up Failure to check file contents causes node to crash during start up [JIRA: RIAK-2162] Sep 3, 2015
@jonmeredith
Copy link
Contributor

The diagnosis makes sense. In the second clause above you should probably handle it the same way as a filesystem error.

           ok = file:write_file(Name,term_to_binary(1)),
           0

I haven't read the rest of the gen_leader code for a very long time, so have no idea how safe just punting and picking incarnation 0 is. This fix would be 'as-good' (or 'as-bad') as the default behavior and tolerate restart after out of disk space.

@jonmeredith
Copy link
Contributor

Read the code a little more carefully - you're proposing that inline which would work fine. Only difference is to match the other could should return 0 rather than 1 I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants