-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quorum gets OOM killed, strange log output #678
Comments
Will check if the issue still occurs with Quorum 2.2.3... |
I notice the messages in your log relating to sync failure and "impossible reorg". It's also worth verifying that the genesis on that node matches the other nodes. |
Hi! I have upgraded to The i/o timeout errors still occur. It looks like it comes from the RPC API. I'm trying to verify if anyone in our team might run some scripts that could cause the instance to crash. Perhaps it is related to ethereum/go-ethereum#17016 I verified the genesis json is the same across all nodes:
|
Our nginx in front of geth prints this on a regular basis which seems to match with the i/o errors. Still trying to find out what request is sent.
|
One of our developers says it might be this call: He mentioned he reads 500k blocks at a time in a loop. |
After trying to find out what the issue might be we get this error (we haven't modified the instance).
|
Do you recall what activity was occurring when you got that particular error? |
Could there be an underlying issue? Perhaps disk issue? |
Hi @SatpalSandhu61 and @fixanoid We are pretty sure the error is caused by the RPC API. We disabled the RPC ingress which caused the errors to go away and the instance continue to run with normal memory usage. We were also able to isolate which exact requests get sent that cause the instance to print the My colleague is sending a series of 1000 requests at a time for collecting individual block data:
|
I've tried running that - directly on my mac though, not using docker, as it's easier for me to monitor the memory usage. I can see the memory usage for that geth instance does jump up, from around 640MB to 1.33GB. However it does stabilize there, so I don't think there's a memory leak. @fixanoid, if you want to try this, here is a script I created to generate the large RPC call:
I run this against a local network where I have around 10,000 blocks. |
Thanks! I tested the same and I wasn't able to kill the instance either. More in detail: First, I disabled the problematic ingress route so that the client causing the issue no longer makes the instance crash. I verified that the problematic instance runs for two hours without crashing. Next, I extended @SatpalSandhu61 's script to a python 2 script that reads all blocks of the chain in 1000 block chunks:
I left it running for a while on a chain with As a last step, I will try to write a simple Node.js proxy application to put in between so I log the exact requests and convert them to a replayable script. |
Hi @MitchK, I was able to reproduce the i/o timeout by making
I disabled the timeout in the code for testing, and everything works fine. Is it possible to test it out by specifying a large value for http timeouts and optimize the values depending on how long the rpc call takes. |
Hi @jbhurat
Now, no more error messages show up and the request I mentioned above goes through (between 60-70 seconds per request). Still, the memory is an issue for us. The graph below shows what happens with a single batch request ( The response body size of the Is this intended behavior, such as caching of chain data? If so, is that something we could control/configure? I found this flag, but it is already set to 128MB by default, which is low enough.
|
What version of quorum are you using, issue description mentioned version 2.2.1 which should have a default cache size of |
Hi @jbhurat ! Sorry about, that I quickly used my local Yes, this is correct, it is 1024. I can try to lower it and see if it makes any difference.
|
Hi @MitchK, how did your test go. Did you get any better results |
Hi @MitchK, closing the ticket for now, as there is no activity. Please reopen or create a new issue if you need any further assistance. |
System information
Quorum version: 2.2.1 Edit: Upgraded to 2.2.3
OS & Version: Linux (Docker container, Alpine Linux)
Available memory: 3GiB
Expected behaviour
Quorum instance should keep running and not use more than 1 GiB of RAM. See one memory profile of one of our healthy instances:
Actual behaviour
Memory usage shoots up gradually and after approx. 3-6 minutes, Quorum Docker container gets OOM killed without any error message. Perhaps a memory leak?
Steps to reproduce the behaviour
It is very hard to reproduce the issue. Out of 30+ Quorum test environment instances internally we run with the same exact configuration. Just the one mentioned below keeps failing with the issues mentioned below.
Geth startups command:
Backtrace
Logs from the last 6 minutes (truncated)
The text was updated successfully, but these errors were encountered: