-
Notifications
You must be signed in to change notification settings - Fork 1.7k
regression with 2.3.3 and 2.2.10: huge disk reads #10361
Comments
I can concur I am seeing this as well. I went from zero iowait to upwards of 24%. This is on a hardware RAID-10 of Samsung pro SSD's. |
same here as well on 3 archive+trace. this seems to affect performance in general in a very bad way. and now, due to the RPC attack issue, we can t downgrade either. |
Can people here mention which version they are coming from? Specifically whether they are updating from pre-2.2.9/2.3.2 or only a single patch number. |
@joshua-mir in our case at least 3 of 4 were from v2.2.7 to v2.2.10 (v2.2.7 also had somekind of performance drops compared to to v2.1.x but I did not have any chance to document it and it was not nearly this big) |
@joshua-mir The issue was only seen in 2.2.9->2.2.10 and 2.3.2->2.3.3 by me. In my case when BurstBalance of the volume hits 0% and IOPS is limited to a very low baseline this immediately results in peers being dropped (i guess parity-ethereum isn't replying fast enough to incoming packets from the peers due to being stuck in IOWAIT for a long time). |
Mine was from 2.2.9 to 2.2.10 as well. |
I'm experiencing the same issues also. Read operations between 2.2.9 and 2.2.10 are significantly higher on average. Read ops over 3 days below. Block propagation times are suffering due to it, with differences of 2.2.9 being around 500-600ms on average and 2.2.10 being around 1.3s at the same specs. This is something that worries me for later versions, I really don't want to worry about the integrity and performance of our Parity boxes on new versions I upgrade to, especially as 2.2.10 was for a critical security fix. |
Could we remove the This issue prevents us currently from upgrading beyond 2.2.9/2.3.2 and I guess the soon to be released 2.3.5-stable and 2.4.0-beta won't have a fix for this issue. |
Any update on this? |
we re seeing some encouraging signs with v2.3.8, can anyone else take a look? looks almost the same as v2.2.7 |
I was waiting to hear something like this. Will test ASAP. Thanks @tzapu |
@tzapu Can't confirm. Still heavy IO with 2.3.8. |
k, uploaded. sha256sums:
note: these binaries are master (2.5.0) without the specified commits (or any dependent commits) I haven't checked whether they work properly in normal usecases tbh, only that they run. You will likely have the best experience with " |
If anyone is interested in testing: you can download the binaries from my dropbox. But please check the sha256sums as posted above by @joshua-mir before executing binaries downloaded from someone off the internet, first! Testing right now, will report back on monday. |
@c0deright Can you use my script and do some RPC port load testing as well? It requires parallel installed. "apt install parallel" or "yum install parallel". Maybe do 10000 requests or so. This would emulate more of what a pool does.
|
Can't do it right now. I'm away from the keyboard until monday and just have the 4 binaries running for ~18 hours each on that t3.medium instance. Maybe afterwards when one binary sticks out I can do your test. |
wow, so the binary I thought would be the most stable is the most unstable. |
Let's not jump to conclusions easily. I still think that peers connected to my node might have a play in here. When e.g. a peer is requesting old blocks not in my nodes cache it has to read them from disk, right? So maybe this was the case here where one or more peers did request old blocks frequently. That's why I'm retesting From the old graphs in this thread one can see that the read spike did in fact start right after one of the affected parity versions was started not several hours after normal operation. So this might really be a stupid coincidence caused by a peer after 3 hours runtime. |
As I said I did retest I think it's safe to say that all 4 binaries @joshua-mir supplied don't show this behaviour and Right now I'm testing |
Thanks for all that work regardless 🙏 |
I just looked at network graphs: There is no out of the ordinary traffic anomaly that correlates to performance degradation caused by heavy disk reads. If it's really caused by peers then it's not because we're sending huge amounts of blocks read from disk to our peers. |
I used a modified version (diff) of your script to run on my test machine against parity-2.4.4-beta:
Results:
|
I've rolled v2.4.5 stable to our Parity instances and it's looking a lot better on my end as well (presuming 2.4.5 has the fix in?). I'm seeing around 500 IOPS average compared to around 1200 IOPS. This is only the first 5 hours of running the version, but on the previous bad version, I saw high disk usage by now. My only complaint which has always been the case with Parity is that the read size per op seems quite inefficient. You see each read size being around 20Kb, when even an SSD with small block size per op can typically fit in around 256Kb per op. If this was batched a bit better, you could really reduce the cost of running a full node. |
Yeah same. Thanks for linking. |
@c0deright I'm sure. I did raise an eyebrow to it considering you was testing 2.4.4 beta. This is from the latest docker |
@joshua-mir How is this possible when in fact there isn't even a 2.4.5 tag let alone a 2.4.5 release?
|
@c0deright #10561 has a change to our CI pipeline which must have accidentally pushed to dockerhub - I'll let our DevOps team know Edit: that pr is probably unrelated Edit 2: this is expected - stable and beta are always ahead of releases on docker @jleeh |
confirming 2.4 is stable now. b6a764d |
For me this issue is fixed with v2.3.9-stable and v2.4.4-beta. Does anybody else see heavy disk IO with these versions? |
v2.3.9-stable is running for ~43 hours now and in the morning there was a 2 hour window where read operations per second did increase by a factor of 8 and then went back to normal again: The Logfile doesn't seem to indicate anything at all besides 3 Reorgs at the time read ops did begin to increase. I still think that this issue is fixed but there seems to be still an issue with potentially peer induced read spike. |
So you are seeing IO spiking on reorgs maybe? I'll close this for now and if we see evidence that the issue hasn't been resolved in ordinary cases I'll reopen, or perhaps open a new issue if we can confirm this is reorg related (although high io on reorg is probably expected) |
I can't confirm if it has anything to do with reorgs, just that there were 3 in the log at the time. Might be completely unrelated. I'll test some more with v2.3.9-stable and then v2.4.5-stable and might then upgrade prod servers in a week or two. Thanks for helping debugging this, @joshua-mir |
I see this issue again with 2.6.0-beta. parity v2.6.0 is constantly reading from disk even with only 5 peers connected. I'm comparing disk IO to v2.2.9 we're still running in prod because of #10371 Running parity on AWS EBS volumes of type gp2 is almost impossible because it drains the IO credits so fast. I have v2.6.0 now configured to only connect to my private nodes ( Edit: Seems to be Upgrading the VM from 4GB of RAM to 8GB fixes the massive reads. |
Yesterday I upgraded my stable boxes from 2.2.9 zu 2.2.10 and my test box from 2.3.2-beta to 2.3.3-beta.
IO seems to be much higher than with 2.2.9 and 2.3.2 because a couple hours later all the EBS gp2 volumes the EC2 instances use had exhausted their BurstBalance credits.
From the monitoring I can see that versions 2.2.10 and 2.3.3 do much more disk reads than the versions before. Switching back to 2.2.9 and 2.3.2 resolved the issue: let the systems run for several hours, no heavy IO.
Switching back to 2.3.3 and 2.2.10: heavy disk reads was happening again.
The text was updated successfully, but these errors were encountered: