Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inker delete_confirmed call not cast #341

Closed
martinsumner opened this issue Jul 26, 2021 · 1 comment
Closed

Inker delete_confirmed call not cast #341

martinsumner opened this issue Jul 26, 2021 · 1 comment

Comments

@martinsumner
Copy link
Owner

When a leveled_cdb process is delete_pending, it will call delete_confirmed back to the inker (after a timeout) to see if it can be removed. This will then be removed if and only if there are no snapshots requiring access to the file.

In the ledger/penciller the leveled_sst files use a cast to the penciller confirm_delete.

In a production system, some timeouts have been observed when leveled_cdb attempts to confirm_delete. This call uses the default 5s timeout. In all three cases where we have seen this, the leveled_cdb process had a key_check message in its message queue, when it got this timeout.

key_check messages are generated by inkers - but this can't be a simple deadlock as given the leveled_cdb process is in the delete_pending state, only a snapshot inker should be sending a key_check message.

Is it a more complicated 3-way deadlock e.g. leveled_cdb calls leveled_inker (confirm_delete), leveled_inker is waiting on leveled_inker snapshot, leveled_inker snapshot is waiting on key_check call to leveled_cdb. It is not obvious what message between inker and snapshot would cause this.

If this issue is caused by a deadlock, then we can't resolve through changing the timeout. If it is caused by deadlock, then the answer, like in leveled_sst is to make delete_confirmed an async message.

It doesn't appear that a simple deadlock can occur. The only messages between inker and snapshot are to register and release the snapshot, which cannot occur concurrently to a snapshot sending a key_check message. A non-snapshot inker should not send a key_check message to a leveled_cdb in delete_pending state.

Is this just an unfortunate coincidence when 5s was not long enough for confirm_delete as the inker was busy?

@martinsumner
Copy link
Owner Author

#342

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant