Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: opensearch_node container does not restart on error #90

Closed
tuotempo opened this issue Sep 5, 2022 · 12 comments
Closed

[Bug]: opensearch_node container does not restart on error #90

tuotempo opened this issue Sep 5, 2022 · 12 comments
Labels
bug Something isn't working

Comments

@tuotempo
Copy link

tuotempo commented Sep 5, 2022

Describe the bug

I am running an opensearch docker container, that sometimes exits with code 0 even if I see errors on logs

To reproduce

I have created the container with this java options inside Env:
"OPENSEARCH_JAVA_OPTS=-Xms8m -Xmx8m"

so that the java.lang.OutOfMemoryError: Java heap space is triggered almost immediately after the container starts.

Expected behavior

I have set the container's restart policy to on-failure, so I expect that container restarts automatically in those situations.

Screenshots

No response

Host / Environment

OS: Ubuntu
Version 18.04.4 LTS

Additional context

output of docker ps -a --no-trunc:

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5460ce72093cadb1c49331f54c7cc7c0d041174f668aeda157efc199e6938fef opensearchproject/opensearch:1.3.3 "./opensearch-docker-entrypoint.sh opensearch" 7 minutes ago Exited (0) 6 minutes ago

Referring to issue 2143 (which is closed, but not solved), problem is still present in version 1.3.3.

Relevant log output

[2022-07-26T13:31:22,818][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [lb-logs-test] fatal error in thread [main],
exiting java.lang.OutOfMemoryError: Java heap space at java.util.jar.JarFile.lambda$entries$0(JarFile.java:531)
~[?:?] at java.util.jar.JarFile$$Lambda$218/0x00000001001bb840.apply(Unknown Source)
~[?:?] at java.util.zip.ZipFile.getZipEntry(ZipFile.java:676) ~[?:?] at java.util.zip.ZipFile$ZipEntryIterator.next(ZipFile.java:531)
~[?:?] at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:519)
~[?:?] at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:495)
~[?:?] at org.opensearch.bootstrap.JarHell.checkJarHell(JarHell.java:208)
~[opensearch-core-1.3.3.jar:1.3.3] at org.opensearch.plugins.PluginsService.checkBundleJarHell(PluginsService.java:676)
~[opensearch-1.3.3.jar:1.3.3] at org.opensearch.plugins.PluginsService.loadBundles(PluginsService.java:528)
~[opensearch-1.3.3.jar:1.3.3] at org.opensearch.plugins.PluginsService.<init>(PluginsService.java:193)
~[opensearch-1.3.3.jar:1.3.3] at org.opensearch.node.Node.<init>(Node.java:396)
~[opensearch-1.3.3.jar:1.3.3] at org.opensearch.node.Node.<init>(Node.java:319)
~[opensearch-1.3.3.jar:1.3.3] at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242)
~[opensearch-1.3.3.jar:1.3.3] at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242)
~[opensearch-1.3.3.jar:1.3.3] at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:412)
~[opensearch-1.3.3.jar:1.3.3] at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:178)
~[opensearch-1.3.3.jar:1.3.3] at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:169)
~[opensearch-1.3.3.jar:1.3.3] at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:100)
~[opensearch-1.3.3.jar:1.3.3] at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
~[opensearch-cli-1.3.3.jar:1.3.3] at org.opensearch.cli.Command.main(Command.java:101)
~[opensearch-cli-1.3.3.jar:1.3.3] at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:135)
~[opensearch-1.3.3.jar:1.3.3] at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:101)
~[opensearch-1.3.3.jar:1.3.3] Killing performance analyzer process 11 OpenSearch exited with code 127 Performance analyzer exited with code 143
@tuotempo tuotempo added bug Something isn't working untriaged Issues that have not yet been triaged labels Sep 5, 2022
@dblock dblock transferred this issue from opensearch-project/opensearch-build Sep 5, 2022
@prudhvigodithi prudhvigodithi removed the untriaged Issues that have not yet been triaged label Sep 20, 2022
@prudhvigodithi
Copy link
Member

Hey @tuotempo can you please share the full docker run command you used?
also can you try with this sample compose file https://opensearch.org/samples/docker-compose.yml?

@tuotempo
Copy link
Author

tuotempo commented Sep 29, 2022

Hello,
here is the docker run command I used:

docker run -d -p '9200:9200' -p '9300:9300' -p '9600:9600' --name opensearch_node --restart on-failure -h opensearchnode \
--network opensearch_net \
--privileged \
-v /etc/opensearch/esnode.pem:/usr/share/opensearch/config/esnode.pem \
-v /etc/opensearch/esnode-key.pem:/usr/share/opensearch/config/esnode-key.pem \
-v /etc/opensearch/root-ca.pem:/usr/share/opensearch/config/root-ca.pem \
-v /etc/opensearch/kirk.pem:/usr/share/opensearch/config/kirk.pem \
-v /etc/opensearch/kirk-key.pem:/usr/share/opensearch/config/kirk-key.pem \
-v /etc/opensearch/opensearch.yml:/usr/share/opensearch/config/opensearch.yml \
-v /etc/opensearch/plugin.security_conf.yml:/usr/share/opensearch/plugins/opensearch-security/securityconfig/config.yml \
-v /etc/opensearch/jvm.options:/usr/share/opensearch/config/jvm.options \
-v /opt/opensearch:/usr/share/opensearch/data \
-e "OPENSEARCH_JAVA_OPTS=-Xms4g -Xmx4g" \
-e "DISABLE_INSTALL_DEMO_CONFIG=true" \
opensearchproject/opensearch:1.3.3

I before created a network with docker network create opensearch_net.

I also tried your sample with docker compose up -d: containers starts, but opensearch-node1 and opensearch-node2 exits after about one minute, while opensearch-dashboards keeps running. Here is the log I get:

ERROR: [1] bootstrap checks failed
[1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
ERROR: OpenSearch did not exit normally - check the logs at /usr/share/opensearch/logs/opensearch-cluster.log
[2022-09-29T13:28:46,639][INFO ][o.o.s.a.r.AuditMessageRouter] [opensearch-node1] Closing AuditMessageRouter
[2022-09-29T13:28:46,640][INFO ][o.o.s.a.s.SinkProvider   ] [opensearch-node1] Closing InternalOpenSearchSink
[2022-09-29T13:28:46,641][INFO ][o.o.s.a.s.SinkProvider   ] [opensearch-node1] Closing DebugSink
[2022-09-29T13:28:46,647][INFO ][o.o.n.Node               ] [opensearch-node1] stopping ...
[2022-09-29T13:28:46,666][INFO ][o.o.n.Node               ] [opensearch-node1] stopped
[2022-09-29T13:28:46,667][INFO ][o.o.n.Node               ] [opensearch-node1] closing ...
[2022-09-29T13:28:46,682][INFO ][o.o.s.a.i.AuditLogImpl   ] [opensearch-node1] Closing AuditLogImpl
[2022-09-29T13:28:46,687][INFO ][o.o.n.Node               ] [opensearch-node1] closed
Killing performance analyzer process 103
OpenSearch exited with code 78
Performance analyzer exited with code 143

And the 2 containers remains stopped.

Regards.

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Sep 29, 2022

Hi @tuotempo,

[1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

That means you need to increase the max map count value:

sudo sysctl -w vm.max_map_count=262144
ulimit -n 65535

You can follow this readme on more information of the docker settings:
https://github.com/opensearch-project/opensearch-build/blob/main/scripts/README.md#run-deployment-script

Thanks.
😄

@tuotempo
Copy link
Author

tuotempo commented Oct 5, 2022

Hi @peterzhuamazon,
thanks for the hint, I can confirm that with it containers opensearch-node1 and opensearch-node2 now do not exit.

But, as stated in my first post, issue is that my opensearch-node container sometimes stops, and on-failure restart policy is not effective because exit code is 0. As stated in issue 2143, maybe the issue is in the terminateProcesses function.

Thank you very much for your support.

@tuotempo
Copy link
Author

Hello @prudhvigodithi
are there any news about this issue?

Thanks again.

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Oct 19, 2022

Hey @tuotempo if would be great if you can help you find the exact cause of termination is it still with JVM?
Also check this for a retry implementation opensearch-project/opensearch-js#304, might be helpful here.
Thank you
@bbarani @dblock

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Oct 19, 2022

The teminateProcess is by design in the sense of if there is failure just kill both OS and PA process and stop the container.
So that another process can be brought up by Kube or Docker swarm.

@tuotempo
Copy link
Author

Hello,
sorry for the delay, I have created a fork to test some modification, you can see it here. I have basically set different traps for different signals, and added the cleanup function that simply prints what is the signal involved: do you think this approach makes sense?

@prudhvigodithi the cause of termination was an OutOfMemoryError, as stated before, and it happens randomly on my production systems.

@peterzhuamazon As about teminateProcess by design, I am not using Kube or Docker swarm, but a simple container, so it will be useful to have the restart policy working.

Thanks again for your support.

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Nov 17, 2022

Hello, sorry for the delay, I have created a fork to test some modification, you can see it here. I have basically set different traps for different signals, and added the cleanup function that simply prints what is the signal involved: do you think this approach makes sense?

@prudhvigodithi the cause of termination was an OutOfMemoryError, as stated before, and it happens randomly on my production systems.

@peterzhuamazon As about teminateProcess by design, I am not using Kube or Docker swarm, but a simple container, so it will be useful to have the restart policy working.

Thanks again for your support.

Hi, we are in the process to remove this dependency now.

Thanks.

@jordarlu
Copy link
Contributor

jordarlu commented Feb 14, 2023

hello, @tuotempo ,just want to update you that the activity mentioned above, opensearch-project/opensearch-build#2876, had been completed in sometime back, and we would love to get your feedback if it did resolve the issue ( by disabling the PA in your case/env ), or if you still see the problem over there, thanks!!

@tuotempo
Copy link
Author

Hi @jordarlu i can confirm that disabling PA the exit code of opensearch process is correctly propagated into docker and the policy works as expected

@jordarlu
Copy link
Contributor

Thanks a lot for prompt reply @tuotempo , I really appreciate it!! I will close this issue now, and pls keep letting us know if you find any other issues. Thank You.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants