-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor nitpicks
ethstore/src/dir/disk.rs
Outdated
|
||
fn last_modification_date(&self) -> Result<u64, Error> { | ||
use std::time::UNIX_EPOCH; | ||
let duration = fs::metadata(&self.path)?.modified()?.duration_since(UNIX_EPOCH).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect
?
return Ok(()) | ||
let mut last_timestamp = self.timestamp.lock(); | ||
let now = Instant::now(); | ||
if (now - last_timestamp.last_checked) > Duration::from_secs(REFRESH_TIME_SEC) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather store last_timestamp.valid_till
to avoid subtraction (and possible overflow in case the time is adjusted by the system):
if last_timestamp.valid_till < now {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instant
is monotonic. It is not affected by clock changes and is guaranteed to be never decreased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
ethstore/src/ethstore.rs
Outdated
.cloned() | ||
.ok_or(Error::InvalidAccount) | ||
let cache = self.cache.read(); | ||
let mut r = cache.range((::std::collections::Bound::Included(*address), ::std::collections::Bound::Included(*address))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ::std::collections::Bound
at the top of the function?
8d8a11a
to
52e2adc
Compare
This fixes the issue #6787 and is now finding/keeping peers. |
let mut last_timestamp = self.timestamp.lock(); | ||
let now = Instant::now(); | ||
if (now - last_timestamp.last_checked) > Duration::from_secs(REFRESH_TIME_SEC) { | ||
let dir_hash = Some(self.dir.unique_repr()?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be creating a disk IO read problem. Because every 5 seconds you end up looping and hashing all files in the keystore...if this is a large number of files (in my case 150,000)...in the end causes total server crash.
Is there a better way to detect changes in the keystore dir? Possibly only detect a change when new key is added via personal_newAccount... calls. Or simple file count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unique_repr
is changed to return directory' last modification date in this PR. So no more hashing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AdvancedStyle are you saying your server crashes with the changes from this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, doing more testing and will provide more information shortly.
Basically once I compile/run latest master with this PR it fixes the peer connections, but then crashes the server (with noticeably huge amount of disk reading).
But i'll do a bit more testing and see if I can gather some actually error messages from the server crash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i've been using iostat / iotop / tracefile (https://gitlab.com/ole.tange/tangetools/tree/master/tracefile) to montor disk IO and here are the results:
With using latest master the iostat util is quite low (less than 5%) however the tracefiles shows that parity is constantly accessing the keystore files...but it is not causing high disk util and not causing server crash..only causing peers to drop/timeout and syncing to stop.
After PR #6827 the iostat util goes through the roof (50-100% disk util all the time):
iostat output sample:
avg-cpu: %user %nice %system %iowait %steal %idle 23.14 0.00 20.48 13.56 0.27 42.55Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.00 5339.00 0.00 49176.00 0.00 18.42 0.75 0.14 0.14 0.00 0.09 49.60
avg-cpu: %user %nice %system %iowait %steal %idle
20.21 0.00 17.32 14.70 0.26 47.51
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 17.00 5192.00 3.00 46776.00 80.00 18.04 0.74 0.14 0.14 2.67 0.11 55.60
avg-cpu: %user %nice %system %iowait %steal %idle
25.07 0.00 15.04 10.55 0.53 48.81
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 28.00 4083.00 2.00 38940.00 120.00 19.12 0.60 0.15 0.15 2.00 0.10 42.40
avg-cpu: %user %nice %system %iowait %steal %idle
22.13 0.00 18.67 15.73 0.27 43.20
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 10.00 3744.00 4.00 32000.00 56.00 17.11 0.83 0.22 0.21 1.00 0.17 64.00
avg-cpu: %user %nice %system %iowait %steal %idle
24.42 0.00 15.32 32.21 0.26 27.79
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 13.00 2987.00 4.00 28780.00 68.00 19.29 2.07 0.70 0.70 1.00 0.27 80.80
But the strange thing is the tracefile (strace script) is not really seeing a lot of file access on the process (parity is no longer accessing the keystore files individuals)... it only appears to be accessing the keystore dir...so why all the disk IO?:
Sample of tracefile output:
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484066.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484067.log
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484068.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484066.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484068.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484069.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484069.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484066.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484064.sst
/home/ethereum/.ethereum/keystore/ethereum
/home/ethereum/.ethereum/keystore/ethereum
/home/ethereum/.ethereum/keystore/ethereum
/home/ethereum/.ethereum/keystore/ethereum
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484070.log
/sys/devices/system/cpu/online
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484071.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484071.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484072.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484072.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484071.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484068.sst
/home/ethereum/.ethereum/keystore/ethereum
/home/ethereum/.ethereum/keystore/ethereum
VM then hangs / crashes after a few minutes with hung_task_timeout_secs
INFO: task kworker/[...] blocked for more than 120 seconds.
Not tainted [...]
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
And a bunch of similar hung tasks.
Conclusion:
For some reason the modified time check on the keystore dir might be creating a lot of IO read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IO must be just the database IO. Does the machine have HDD or SSD? How exactly does it crash? Is there a backtrace printed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSD. Server running on a KVM virtual machine (Debian host, Ubuntu 16 guest).
The entire virtual machine just locks up; SSH non-responsive and virtmanager console non-responsive, then it dumps a series of error messages in the virtmanager console similar to:
INFO: task kworker/[...] blocked for more than 120 seconds.
Not tainted [...]
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Note, i'm running parity with params:
screen -d -m parity --cache-size 4096 --allow-ips public --jsonrpc-apis web3,eth,net,parity,traces,rpc,personal --geth --keys-path /home/ethereum/.ethereum/keystore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to clone the VM and try to delete/resync the chain from 0 to see if that helps, and so far it seems to be doing ok (no IO read problems so far)...will post back shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closes #6456, #6787