Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker deamon for ERDDAP hosted on AWS keeps crashing #69

Open
MathewBiddle opened this issue Apr 17, 2024 · 22 comments
Open

Docker deamon for ERDDAP hosted on AWS keeps crashing #69

MathewBiddle opened this issue Apr 17, 2024 · 22 comments

Comments

@MathewBiddle
Copy link
Contributor

When running this erddap-gold-standard on AWS, every few weeks the docker daemon for the erddap-gold-standard docker deployment crashes.

$ docker-compose restart
ERROR: Couldn't connect to Docker daemon at http+docker://localhost - is it running?
If it's at a non-standard location, specify the URL with the DOCKER_HOST environment variable.

It's a simple fix to get it up and running again using:

$ sudo systemctl start docker
$ docker-compose restart

I'm curious if other folks have experienced this before with an ERDDAP deployed using Docker on AWS??

I've discussed with @patrick-tripp and the current work around would be to set a cronjob to check the url, if it fails, restart docker.

cc: @mwengren, @ocefpaf, @patrick-tripp.

@MathewBiddle
Copy link
Contributor Author

maybe live restore??

https://docs.docker.com/config/containers/live-restore/

@MathewBiddle
Copy link
Contributor Author

Okay, testing live-restore:

$ more /etc/docker/daemon.json
{
        "live-restore": true
}
$ sudo systemctl start docker
$ docker ps
CONTAINER ID   IMAGE                                    COMMAND                  CREATED      STATUS         PORTS                                                                            NAMES
ec3b94b319fe   axiom/docker-erddap:2.23-jdk17-openjdk   "/entrypoint.sh cata…"   7 days ago   Up 5 seconds   0.0.0.0:80->8080/tcp, :::80->8080/tcp, 0.0.0.0:443->8443/tcp, :::443->8443/tcp   erddap_gold_standard

I will check back in a few weeks to see if this fixes the issue. Luckily we have plenty of checks hitting this server, so we will know quickly when it breaks.

@MathewBiddle
Copy link
Contributor Author

To confirm the change was accepted:

$ docker info | grep Live
 Live Restore Enabled: true

@srstsavage
Copy link
Contributor

Do you have access to the docker daemon logs? Also what are the docker and kernel versions?

@MathewBiddle
Copy link
Contributor Author

Do you have access to the docker daemon logs?

I have access to /var/log which has a few messages files. I think those are the logs as documented here.

Also what are the docker and kernel versions?

$ docker --version
Docker version 20.10.25, build b82b9f3
$ uname -sr
Linux 5.10.210-201.852.amzn2.x86_64

@MathewBiddle
Copy link
Contributor Author

Live Restore seems to be working. From status:

Current time is 2024-05-06T15:44:10+00:00
Startup was at  2024-04-17T13:24:27+00:00

I'll keep this open until 2 months have passed without the daemon crashing.

@MathewBiddle
Copy link
Contributor Author

Boo... looks like it crashed again.

$ docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Restarted with:

/usr/local/erddap-gold-standard$ sudo systemctl start docker
/usr/local/erddap-gold-standard$ docker-compose restart
Restarting erddap_gold_standard ... done
/usr/local/erddap-gold-standard$ docker info | grep Live
 Live Restore Enabled: true

@ocefpaf
Copy link
Member

ocefpaf commented Jun 11, 2024

Boo... looks like it crashed again.

Same frequency as before, sooner, or later? We need to inspect the logs here to see if we can understand what is going on.

@MathewBiddle
Copy link
Contributor Author

much later - almost 3 months vs a few weeks. I looked at the logs and they are gobbledygook to me 😵

@ocefpaf
Copy link
Member

ocefpaf commented Jun 11, 2024

much later - almost 3 months vs a few weeks. I looked at the logs and they are gobbledygook to me 😵

Well, maybe that is a (small) win. I never looked into ERDDAP logs, we should probably ask for help here from the experts (Ben, Chris, Shane).

@MathewBiddle
Copy link
Contributor Author

and it went down again. ~1.5 months

@MathewBiddle
Copy link
Contributor Author

and down again - that was at least a month.

@ocefpaf
Copy link
Member

ocefpaf commented Aug 23, 2024

We should check the logs, try to investigate further on what may be going on.

@srstsavage
Copy link
Contributor

Do you have metrics on the memory usage, number of open files, etc on this instance through time?

@MathewBiddle
Copy link
Contributor Author

ummm, I'm going to say no. I this something I could use https://github.com/callumrollo/erddaplogs for?

@srstsavage
Copy link
Contributor

There may be useful hints in the ERDDAP logs (max memory usage etc), but I was referring more to host level metrics on memory, number of open files, system load, etc. Typically this is collected using an agent running on the host sending its metrics somewhere for analysis.

Probably the easiest since its the AWS solution:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html

Others:
https://grafana.com/docs/grafana/latest/getting-started/get-started-grafana-prometheus/
https://www.netdata.cloud/

@srstsavage
Copy link
Contributor

See also @rmendels comments here:

ERDDAP/erddap#185 (comment)

[...] monitor the usage of the following:

  • heap space use
  • metaspace use
  • total java memory use (and the total memory available)
  • swap space use (and total swap space available)
  • number of threads running under java (I find the number given by say visualvm is not as good as that given by btop)

A detailed time series isn't as important as likely maximum values of each and some idea of how much they fluctuate. Particularly for total memory use, you need to have the ERDDAP completely loaded and running for a bit to get a feel for the total java memory and number of threads

@rmendels
Copy link

We may be closer to understanding how to prevent this (unfortunately no one simple answer). It has to do both with Java's new memory model, that a lot of non-heap memory gets used if a lot of child threads are started (often 5GB-10GB more), plus how the OS behaves, which can (and will) start cacheing all of the file requests. We were seeing this on our system, do not so presently, mostly because we added more memory and it turns out you need a lot. Since then our memory use has stayed pretty constant. Add yes, in order to give any advice we need the metrics above, as well as if possible the cache memory usage.

@benjwadams
Copy link

@MatthewBiddle, if running on a Linux system, check journalctl logs -- You may see some log entries from OOMKiller killing Docker or the Java process which could be a case of the ERDDAP issue with memory I've reported, and @srstsavage has also posted in this issue ERDDAP/erddap#185

@benjwadams
Copy link

Also check uptime on the box. Memory exhaustion can in some cases lead to a system seizing up entirely, requiring a restart. If you don't have Docker daemon enabled on system startup and exhaust memory, it would not start. If you have sar on your system it can report historical memory usage over time without setting up other tools or CloudWatch. Unfortunately, last time I checked SystemD doesn't report logs from before the last startup without configuration, so the above may not work if posted.

However, if you do have the memory exhaustion going on, it would be much simpler to reproduce with the datasets you have vs some larger systems such as the IOOS Glider DAC.

@rmendels
Copy link

@benjwadams @srstsavage I have twice requested from you the information @srstsavage points to. Without that information we can't be of much help. It turns out that for heavily used ERDDAP with lot of files or a lot fo aggregated files the required memory can be quite high (not heap, but total memory), there are some settings that can help, but without that information to see what is filling up how there is not much more we can do. As I said, our ERDDAP is now running with no memory problems, but it needs a lot of memory, and swapping must be avoided.

@MathewBiddle
Copy link
Contributor Author

And it's down again.

Looking at the ERDDAP email logs leading up to the point at which it crashed (2024-10-12), I don't see a memory issue happening:

$ grep "OS info" emailLog2024-10*
emailLog2024-10-01.txt:OS info: totalCPULoad=0.071428165 processCPULoad=0.0012237673 totalMemory=7834MB freeMemory=624MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-02.txt:OS info: totalCPULoad=0.06577647 processCPULoad=0.0012330246 totalMemory=7834MB freeMemory=622MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-03.txt:OS info: totalCPULoad=0.06445922 processCPULoad=0.0011969458 totalMemory=7834MB freeMemory=649MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-04.txt:OS info: totalCPULoad=0.0764798 processCPULoad=0.0013039889 totalMemory=7834MB freeMemory=590MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-05.txt:OS info: totalCPULoad=0.06200041 processCPULoad=0.0013341357 totalMemory=7834MB freeMemory=532MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-06.txt:OS info: totalCPULoad=0.17571415 processCPULoad=0.0012315342 totalMemory=7834MB freeMemory=592MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-07.txt:OS info: totalCPULoad=0.059477385 processCPULoad=0.0012016778 totalMemory=7834MB freeMemory=588MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-08.txt:OS info: totalCPULoad=0.069818884 processCPULoad=0.0012112252 totalMemory=7834MB freeMemory=538MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-09.txt:OS info: totalCPULoad=0.06342337 processCPULoad=0.0011966674 totalMemory=7834MB freeMemory=590MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-10.txt:OS info: totalCPULoad=0.07082052 processCPULoad=0.0012226121 totalMemory=7834MB freeMemory=545MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-11.txt:OS info: totalCPULoad=0.06763653 processCPULoad=0.0011976592 totalMemory=7834MB freeMemory=560MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-12.txt:OS info: totalCPULoad=0.06363476 processCPULoad=0.0012607691 totalMemory=7834MB freeMemory=700MB totalSwapSpace=0MB freeSwapSpace=0MB

I'm looking at journalctl logs but not seeing anything for OOMKiller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants