-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler is crashing in some container environments #2946
Comments
Perhaps another solution would be to add Also what happens to Selenium when we disable |
Do you mean users would need to run |
There is a pretty good discussion on the issues the |
Selenium will not change any behavior, the behavior will be changed only on Chrome (
Chrome will try to write to /dev/shm if no data and temp directory is provided. Without proper permissions, it will crash. Furthermore, this directory need to have proper size available ( RAM ). At this link there is one discussion about similar issues:
When you disable it, you are basically telling Chrome: write temporary data (e.g., profile, cache) to disk (/tmp, or a directory passed with parameter), and not to shared RAM. On most container environments, you can't write to shared RAM by default. Maybe since haystack already is a heavy RAM framework if doing NLP, it would be better to disable Crawler from using RAM and GPU. So other NLP tasks on the same machine won't be affected. Chrome will use the disk. I think the cons of disabling /dev/shm writing apply for testings scenarios (see quote below) like Selenium Grid and other where there are many processes and tabs (a dedicated testing environment), so disk I/O may increase. On haystack scenario, using RAM or GPU, on my perspective, is indeed not the preferred option.
About the GPU option, you may see puppeteer devs comment in same reference issue:
Yet at the same referenced issue, many other issues referencing that issue are shown, and most projects have made the option to disable shared memory writing. But, also, there are problems when writing to disk on heavy testing scenarios. What makes me think about the default option ("most" safe), and the possibility to set user-defined options. 🤔 What about allowing the user to send a list of parameters to be sent to Chrome, and if none sent, use a predefined set of options, which would be the safest for a scenario where we can't:
And we, definitely, know that someone, who is using haystack, will be doing NLP, so the chances of RAM/GPU being in trouble are larger. And a last point: GPU instances are expensive in the cloud. No need to allow a simple Crawler utility to use it. |
Hey @sjrl and @danielbichuetti - As far as I can tell this Issue is related to PR #2921 which has been closed. I will close this issue since the PR was merged. Let me know if that's wrong and I will re-open |
@TuanaCelik Hi This issue is not related to #2921. I found it after that, when doing tests on Azure ML notebooks. Where it won't work by default. Furthermore, it won't work on other docker container environment due to not disabling shm access. This issue is related to how the Chrome driver is set up on the Crawler init (the options being sent to Chrome process command-line) |
@danielbichuetti - Got it thanks for the context, I've re-opened it now :) |
@danielbichuetti I agree this sounds like the best idea. Would you be willing to go ahead and open a PR for this? And for the predefined set of options, I think making it clear in the documentation which ones are being used would be very helpful. |
@sjrl Sure! I will open the PR. Thank you! |
Describe the bug
Crawler is crashing when running on some containers, e.g., Azure ML and OpenShift.
Error message
In a WebDriver exception, Chrome driver detect Chrome stopped responding, and say it probably crashed.
Expected behavior
Crawler should run without issues in containers.
Investigation
After some investigation, this points to be a similar error that occurred in Google Colab environment. Without proper flags, Chrome will try to use /dev/shm, and in most container images there is no permission for that (except if you set).
Going to Selenium repository where they store their Dockerfiles, show us their “recommended” setup for a container image. Some important files:
The base image is the most important regarding crashes in general. Disabling shm, like on Colab code, make it unnecessary to set up permissions on shm for some environments.
Furthermore, I must make a note that sometimes Chrome crashes if you don't disable audio and pulseaudio is not installed (this has been tested on AWS Lambda container). So maybe haystack Dockerfile should include it, or Crawler should disable audio.
When using nvidia CUDA images, disabling Chrome GPU usage also decreased random crashes.
As a side note, using Chrome option to spawn a single process saves some memory.
I've made this test branch, running it on Azure ML, Openshift, EKS and AKS increased stability, without Crawler crashes. Obs: this is just a fast test branch, not the one supposed for a PR (if).
Perhaps the AzureML settings should be made default, since haystack doesn't allow user to configure Selenium WebDriver options. Setting the safest options would be, possibly, the recommended.
To Reproduce
Create a Dockerfile with Haystack (Crawler enabled), use it to generate an environment in Azure ML. Try to run Crawler using Jobs.
FAQ Check
System:
The text was updated successfully, but these errors were encountered: