You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a leveled_cdb process is delete_pending, it will call delete_confirmed back to the inker (after a timeout) to see if it can be removed. This will then be removed if and only if there are no snapshots requiring access to the file.
In the ledger/penciller the leveled_sst files use a cast to the penciller confirm_delete.
In a production system, some timeouts have been observed when leveled_cdb attempts to confirm_delete. This call uses the default 5s timeout. In all three cases where we have seen this, the leveled_cdb process had a key_check message in its message queue, when it got this timeout.
key_check messages are generated by inkers - but this can't be a simple deadlock as given the leveled_cdb process is in the delete_pending state, only a snapshot inker should be sending a key_check message.
Is it a more complicated 3-way deadlock e.g. leveled_cdb calls leveled_inker (confirm_delete), leveled_inker is waiting on leveled_inker snapshot, leveled_inker snapshot is waiting on key_check call to leveled_cdb. It is not obvious what message between inker and snapshot would cause this.
If this issue is caused by a deadlock, then we can't resolve through changing the timeout. If it is caused by deadlock, then the answer, like in leveled_sst is to make delete_confirmed an async message.
It doesn't appear that a simple deadlock can occur. The only messages between inker and snapshot are to register and release the snapshot, which cannot occur concurrently to a snapshot sending a key_check message. A non-snapshot inker should not send a key_check message to a leveled_cdb in delete_pending state.
Is this just an unfortunate coincidence when 5s was not long enough for confirm_delete as the inker was busy?
The text was updated successfully, but these errors were encountered:
When a leveled_cdb process is delete_pending, it will call delete_confirmed back to the inker (after a timeout) to see if it can be removed. This will then be removed if and only if there are no snapshots requiring access to the file.
In the ledger/penciller the leveled_sst files use a cast to the penciller confirm_delete.
In a production system, some timeouts have been observed when leveled_cdb attempts to confirm_delete. This call uses the default 5s timeout. In all three cases where we have seen this, the leveled_cdb process had a key_check message in its message queue, when it got this timeout.
key_check messages are generated by inkers - but this can't be a simple deadlock as given the leveled_cdb process is in the
delete_pending
state, only a snapshot inker should be sending a key_check message.Is it a more complicated 3-way deadlock e.g. leveled_cdb calls leveled_inker (confirm_delete), leveled_inker is waiting on leveled_inker snapshot, leveled_inker snapshot is waiting on key_check call to leveled_cdb. It is not obvious what message between inker and snapshot would cause this.
If this issue is caused by a deadlock, then we can't resolve through changing the timeout. If it is caused by deadlock, then the answer, like in leveled_sst is to make delete_confirmed an async message.
It doesn't appear that a simple deadlock can occur. The only messages between inker and snapshot are to
register
andrelease
the snapshot, which cannot occur concurrently to a snapshot sending a key_check message. A non-snapshot inker should not send a key_check message to a leveled_cdb in delete_pending state.Is this just an unfortunate coincidence when 5s was not long enough for confirm_delete as the inker was busy?
The text was updated successfully, but these errors were encountered: