You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In lines 1364-1375 of gen_leader.erl, line 1366 verifies that the file needed to start up the replication leader exists and line 1371 reads the file. However, if the file is empty, corrupted, or otherwise contains bytes that do not form a valid external term format line 1372 will throw an exception. This exception causes the gen_leader start up process to be killed which, in turn, causes the riak_repl process to fail to start, and eventually the entire node is taken down during start up.
To fix this issue, the current version of line 1372 could be replaced with the following line so that any non-integer value or error is ignored. This will prevent Riak from starting when the disk is full, when the data directory is read only, etc. however will not ignore file read errors or file write errors.
Incarn = case catch binary_to_term(Bin) of
I when is_integer(I) -> I;
_ -> 1
end,
The text was updated successfully, but these errors were encountered:
Basho-JIRA
changed the title
Failure to check file contents causes node to crash during start up
Failure to check file contents causes node to crash during start up [JIRA: RIAK-2162]
Sep 3, 2015
The diagnosis makes sense. In the second clause above you should probably handle it the same way as a filesystem error.
ok = file:write_file(Name,term_to_binary(1)),
0
I haven't read the rest of the gen_leader code for a very long time, so have no idea how safe just punting and picking incarnation 0 is. This fix would be 'as-good' (or 'as-bad') as the default behavior and tolerate restart after out of disk space.
Read the code a little more carefully - you're proposing that inline which would work fine. Only difference is to match the other could should return 0 rather than 1 I think.
In lines 1364-1375 of
gen_leader.erl
, line 1366 verifies that the file needed to start up the replication leader exists and line 1371 reads the file. However, if the file is empty, corrupted, or otherwise contains bytes that do not form a valid external term format line 1372 will throw an exception. This exception causes thegen_leader
start up process to be killed which, in turn, causes theriak_repl
process to fail to start, and eventually the entire node is taken down during start up.To fix this issue, the current version of line 1372 could be replaced with the following line so that any non-integer value or error is ignored. This will prevent Riak from starting when the disk is full, when the data directory is read only, etc. however will not ignore file read errors or file write errors.
The text was updated successfully, but these errors were encountered: