skip count can become extremely elevated [JIRA: RIAK-2076] #705

jenafermiller · 2015-08-05T17:56:02Z

The skip count gets incredibly high in a specific situation with high volume.

A set of clusters with cascading replication enabled and realtime replication occurring between the clusters
Puts are normally being made to cluster A, then the realtime queue that goes from cluster C to cluster B will normally see those puts as skips because B will already be in the routed clusters list. Until something is delivered the skip count is not used
The first time something is put to cluster C, it gets replicated to B without a problem. However, after a significant number of puts are sent from A -> B -> C, without any objects going the reserve path (C-> B -> A), when the next object is sent from C -> B there will be an incredibly large skip count that has built up.

We saw a customer with a skip count of just over 73 million objects. In replication efforts, we were able to see skip counts become elevated in the single millions, however, this was not enough to create the latency observed by the customer.

A restart cleared the skip count and the latency returned to normal in the customer cluster.

nerophon · 2017-03-21T10:43:45Z

This has been seen at another customer, this time without cascading replication. The customer saw the issue in RiakKV 2.0.6. The cause of the elevated skip count in this case is not known, but symptoms did manifest when live was switched from cluster A to cluster B.

More information on the specifics can be found in Zendesk:

https://basho.zendesk.com/agent/tickets/14507

As in the ticket, the engineer involved provides a couple of redbug snippets to assist in reproducing the issue:

redbug:start( 
["ordsets:add_element->return "], [ {procs, [ 
Pid || {undefined,Pid,worker,[riak_repl2_rtsink_conn]} <- 
supervisor:which_children(riak_repl2_rtsink_conn_sup) 
]}, 
{msgs,1}, 
{time,120000} 
] 
).

Or... to monitor messages sent and received:

redbug:start( 
['receive','send'], [ {procs, [ 
Pid || {undefined,Pid,worker,[riak_repl2_rtsink_conn]} <- 
supervisor:which_children(riak_repl2_rtsink_conn_sup) 
]}, 
{msgs,1}, 
{time,120000} 
] 
).

Unfortunately tracing is not permitted in the customer's production environment, and efforts to reproduce the issue in a test environment have so far failed.

Note that in this case the workaround for the customer was not a restart, but rather a kill of the process involved:

erlang:exit(list_to_pid("<Rogue process PID goes here>"), kill).

Basho-JIRA changed the title ~~skip count can become extremely elevated~~ skip count can become extremely elevated [JIRA: RIAK-2076] Aug 5, 2015

Basho-JIRA added the JIRA: To Do label Aug 5, 2015

nerophon mentioned this issue Mar 15, 2017

skip count can become extremely elevated basho/riak_ee-issues#31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skip count can become extremely elevated [JIRA: RIAK-2076] #705

skip count can become extremely elevated [JIRA: RIAK-2076] #705

jenafermiller commented Aug 5, 2015

nerophon commented Mar 21, 2017 •

edited

Loading

skip count can become extremely elevated [JIRA: RIAK-2076] #705

skip count can become extremely elevated [JIRA: RIAK-2076] #705

Comments

jenafermiller commented Aug 5, 2015

nerophon commented Mar 21, 2017 • edited Loading

nerophon commented Mar 21, 2017 •

edited

Loading