-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker deamon for ERDDAP hosted on AWS keeps crashing #69
Comments
maybe live restore?? |
Okay, testing
I will check back in a few weeks to see if this fixes the issue. Luckily we have plenty of checks hitting this server, so we will know quickly when it breaks. |
To confirm the change was accepted:
|
Do you have access to the docker daemon logs? Also what are the docker and kernel versions? |
I have access to
|
Live Restore seems to be working. From status:
I'll keep this open until 2 months have passed without the daemon crashing. |
Boo... looks like it crashed again. $ docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? Restarted with: /usr/local/erddap-gold-standard$ sudo systemctl start docker
/usr/local/erddap-gold-standard$ docker-compose restart
Restarting erddap_gold_standard ... done
/usr/local/erddap-gold-standard$ docker info | grep Live
Live Restore Enabled: true |
Same frequency as before, sooner, or later? We need to inspect the logs here to see if we can understand what is going on. |
much later - almost 3 months vs a few weeks. I looked at the logs and they are gobbledygook to me 😵 |
Well, maybe that is a (small) win. I never looked into ERDDAP logs, we should probably ask for help here from the experts (Ben, Chris, Shane). |
and it went down again. ~1.5 months |
and down again - that was at least a month. |
We should check the logs, try to investigate further on what may be going on. |
Do you have metrics on the memory usage, number of open files, etc on this instance through time? |
ummm, I'm going to say no. I this something I could use https://github.com/callumrollo/erddaplogs for? |
There may be useful hints in the ERDDAP logs (max memory usage etc), but I was referring more to host level metrics on memory, number of open files, system load, etc. Typically this is collected using an agent running on the host sending its metrics somewhere for analysis. Probably the easiest since its the AWS solution: Others: |
See also @rmendels comments here:
|
We may be closer to understanding how to prevent this (unfortunately no one simple answer). It has to do both with Java's new memory model, that a lot of non-heap memory gets used if a lot of child threads are started (often 5GB-10GB more), plus how the OS behaves, which can (and will) start cacheing all of the file requests. We were seeing this on our system, do not so presently, mostly because we added more memory and it turns out you need a lot. Since then our memory use has stayed pretty constant. Add yes, in order to give any advice we need the metrics above, as well as if possible the cache memory usage. |
@MatthewBiddle, if running on a Linux system, check |
Also check uptime on the box. Memory exhaustion can in some cases lead to a system seizing up entirely, requiring a restart. If you don't have Docker daemon enabled on system startup and exhaust memory, it would not start. If you have However, if you do have the memory exhaustion going on, it would be much simpler to reproduce with the datasets you have vs some larger systems such as the IOOS Glider DAC. |
@benjwadams @srstsavage I have twice requested from you the information @srstsavage points to. Without that information we can't be of much help. It turns out that for heavily used ERDDAP with lot of files or a lot fo aggregated files the required memory can be quite high (not heap, but total memory), there are some settings that can help, but without that information to see what is filling up how there is not much more we can do. As I said, our ERDDAP is now running with no memory problems, but it needs a lot of memory, and swapping must be avoided. |
And it's down again. Looking at the ERDDAP email logs leading up to the point at which it crashed (2024-10-12), I don't see a memory issue happening: $ grep "OS info" emailLog2024-10*
emailLog2024-10-01.txt:OS info: totalCPULoad=0.071428165 processCPULoad=0.0012237673 totalMemory=7834MB freeMemory=624MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-02.txt:OS info: totalCPULoad=0.06577647 processCPULoad=0.0012330246 totalMemory=7834MB freeMemory=622MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-03.txt:OS info: totalCPULoad=0.06445922 processCPULoad=0.0011969458 totalMemory=7834MB freeMemory=649MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-04.txt:OS info: totalCPULoad=0.0764798 processCPULoad=0.0013039889 totalMemory=7834MB freeMemory=590MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-05.txt:OS info: totalCPULoad=0.06200041 processCPULoad=0.0013341357 totalMemory=7834MB freeMemory=532MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-06.txt:OS info: totalCPULoad=0.17571415 processCPULoad=0.0012315342 totalMemory=7834MB freeMemory=592MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-07.txt:OS info: totalCPULoad=0.059477385 processCPULoad=0.0012016778 totalMemory=7834MB freeMemory=588MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-08.txt:OS info: totalCPULoad=0.069818884 processCPULoad=0.0012112252 totalMemory=7834MB freeMemory=538MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-09.txt:OS info: totalCPULoad=0.06342337 processCPULoad=0.0011966674 totalMemory=7834MB freeMemory=590MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-10.txt:OS info: totalCPULoad=0.07082052 processCPULoad=0.0012226121 totalMemory=7834MB freeMemory=545MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-11.txt:OS info: totalCPULoad=0.06763653 processCPULoad=0.0011976592 totalMemory=7834MB freeMemory=560MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-12.txt:OS info: totalCPULoad=0.06363476 processCPULoad=0.0012607691 totalMemory=7834MB freeMemory=700MB totalSwapSpace=0MB freeSwapSpace=0MB I'm looking at |
When running this erddap-gold-standard on AWS, every few weeks the docker daemon for the erddap-gold-standard docker deployment crashes.
It's a simple fix to get it up and running again using:
I'm curious if other folks have experienced this before with an ERDDAP deployed using Docker on AWS??
I've discussed with @patrick-tripp and the current work around would be to set a cronjob to check the url, if it fails, restart docker.
cc: @mwengren, @ocefpaf, @patrick-tripp.
The text was updated successfully, but these errors were encountered: