-
Notifications
You must be signed in to change notification settings - Fork 265
Issue with Opened File Descriptors when creating a checkpoint #1116
Comments
After checking further I've noticed that when a checkpoint is being created FD are left open. By using lsof -a -p <beam.smp PID> I've noticed a lot of entries marked as deleted which still had a FD Open
On the validators with problems, I had around 7500 OFD. On a validator without any problems I had around 450 OFD. |
Did bumping max file descriptors up high stop this from happening? Not saying there's not a potential leak, but rocksdb can have a lot of file descriptors open while it compacts its database |
It stopped the validator going offline but the issue is still replicating. I can still see OFD marked as deleted. If I restart the validators, those Descriptors will be cleared. Don't know exactly how rocksdb is working but I've seen different issues online regarding this behaviour. I think when checkpoints are being done and files are being renamed, the OFD for those files are not released and remain open until you restart the validator. Maybe this can help facebook/rocksdb#8041. From what I can see, all OFD marked with deleted are related to checkpoints. I can't see any other OFD marked like that. |
I had 5 validators which went offline. After checking the logs I've seen that the number of Open File Descriptor has exceeded the 60k ulimit I had set up on the validator.
Below are the logs that I found on the validators. The logs are just from one validator but all 5 had the same type of logs.
2021-10-06 09:27:34.457 [warning] <0.2165.0>@blockchain_txn:separate_res:388 BUG: unexpected txn validation error format blockchain_txn_poc_receipts_v1 : {'EXIT',{{db_open,"IO error: While open a file for random read: /var/data/checkpoints/1041899/ledger.db-1633512434492386608/ledger.db/6469986.sst: No file descriptors available"},[{blockchain_ledger_v1,open_db_,6,[{file,"blockchain_ledger_v1.erl"},{line,4295}]},{blockchain_ledger_v1,new,3,[{file,"blockchain_ledger_v1.erl"},{line,406}]},{blockchain_ledger_v1,context_snapshot,1,[{file,"blockchain_ledger_v1.erl"},{line,814}]},{blockchain,'-fold_blocks/5-fun-1-',4,[{file,"blockchain.erl"},{line,589}]},{lists,foldl,3,[{file,"lists.erl"},{line,1267}]},{blockchain,ledger_at,3,[{file,"blockchain.erl"},{line,539}]},{blockchain_txn_poc_receipts_v1,check_is_valid_poc,2,[{file,"blockchain_txn_poc_receipts_v1.erl"},{line,274}]},{blockchain_txn_poc_receipts_v1,is_valid,2,[{file,"blockchain_txn_poc_receipts_v1.erl"},{line,189}]}]}} / type=poc_receipts_v1 hash="1JH5e1LAm2FzzWjjjAVHcq2v1EiEQCh28ocmAtkaNt9rsgq6Xe" challenger="shambolic-ivory-mandrill" onion="12kowuYhG3Qi3gVkCV6P3SBnaDFoRtiQtta5ifsxQxLwStWoC7B" path:
2021-10-06 09:27:31.552 [warning] <0.2172.0>@blockchain_ledger_v1:has_snapshot:922 couldn't find checkpoint dir? for 1041900 2021-10-06 09:27:31.552 [warning] <0.2172.0>@blockchain_ledger_v1:has_snapshot:922 couldn't find checkpoint dir? for 1041900 2021-10-06 09:27:31.552 [warning] <0.2172.0>@blockchain_ledger_v1:has_snapshot:922 couldn't find checkpoint dir? for 1041899 2021-10-06 09:27:31.965 [warning] <0.2173.0>@blockchain_ledger_v1:has_snapshot:922 couldn't find checkpoint dir? for 1041899
2021-10-06 09:29:39.628 [info] <0.2531.0>@blockchain_ledger_v1:context_snapshot:824 renamed checkpoint from "/var/data/checkpoints/1041896/ledger.db-1633512578873078916/ledger.db" to "/var/data/checkpoints/1041896/ledger.db" 2021-10-06 09:29:39.629 [info] <0.2531.0>@blockchain_ledger_v1:has_snapshot:892 loading checkpoint from disk with ledger mode delayed
Looking at the graphs for the Open File Descriptors I've noticed a spike on all validators that had this behaviour. I think in some circumstances the File Descriptors are not being closed which leads to a point where the opened file limit is being reached and the validator will come offline.
Bellow are the graphs for the spike in OFD.
https://ibb.co/SPqfDqz
https://ibb.co/f8YHZJz
https://ibb.co/jhdsJxC
The text was updated successfully, but these errors were encountered: