Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler is crashing in some container environments #2946

Closed
1 task done
danielbichuetti opened this issue Aug 2, 2022 · 9 comments · Fixed by #3043
Closed
1 task done

Crawler is crashing in some container environments #2946

danielbichuetti opened this issue Aug 2, 2022 · 9 comments · Fixed by #3043
Labels
topic:crawler type:bug Something isn't working

Comments

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Aug 2, 2022

Describe the bug
Crawler is crashing when running on some containers, e.g., Azure ML and OpenShift.

Error message
In a WebDriver exception, Chrome driver detect Chrome stopped responding, and say it probably crashed.

Expected behavior
Crawler should run without issues in containers.

Investigation
After some investigation, this points to be a similar error that occurred in Google Colab environment. Without proper flags, Chrome will try to use /dev/shm, and in most container images there is no permission for that (except if you set).
Going to Selenium repository where they store their Dockerfiles, show us their “recommended” setup for a container image. Some important files:

The base image is the most important regarding crashes in general. Disabling shm, like on Colab code, make it unnecessary to set up permissions on shm for some environments.
Furthermore, I must make a note that sometimes Chrome crashes if you don't disable audio and pulseaudio is not installed (this has been tested on AWS Lambda container). So maybe haystack Dockerfile should include it, or Crawler should disable audio.
When using nvidia CUDA images, disabling Chrome GPU usage also decreased random crashes.
As a side note, using Chrome option to spawn a single process saves some memory.

I've made this test branch, running it on Azure ML, Openshift, EKS and AKS increased stability, without Crawler crashes. Obs: this is just a fast test branch, not the one supposed for a PR (if).

Perhaps the AzureML settings should be made default, since haystack doesn't allow user to configure Selenium WebDriver options. Setting the safest options would be, possibly, the recommended.

To Reproduce
Create a Dockerfile with Haystack (Crawler enabled), use it to generate an environment in Azure ML. Try to run Crawler using Jobs.

FAQ Check

System:

  • OS: Linux
  • GPU/CPU: NVIDIA V100
  • Haystack version (commit or version number): 1.6.1rc
@sjrl
Copy link
Contributor

sjrl commented Aug 2, 2022

Perhaps the AzureML settings should be made default, since haystack doesn't allow user to configure Selenium WebDriver options. Setting the safest options would be, possibly, the recommended.

Perhaps another solution would be to add options to the initialization of the Crawler so the user could configure Selenium WebDriver options. I agree we should find a reasonable set of default options, but it is unlikely we can find the "safest" options for all scenarios.

Also what happens to Selenium when we disable /dev/shm?

@sjrl
Copy link
Contributor

sjrl commented Aug 2, 2022

Without proper flags, Chrome will try to use /dev/shm, and in most container images there is no permission for that (except if you set).

Do you mean users would need to run chmod +x /dev/shm before using the Crawler? Or is there also an issue of permissions where that chmod command could be rejected? We could consider adding a try-except to the Crawler to run that chmod command and if it fails pass the option --disable-dev-shm-usage.

@sjrl
Copy link
Contributor

sjrl commented Aug 2, 2022

There is a pretty good discussion on the issues the chrome webdriver has in docker environments in general puppeteer/puppeteer#1834. It seems like there are some cons to having --disable-dev-shm-usage on by default, making me think allowing users to choose what options to pass is the way to go. And to add documentation warning about the common issues that one might face when running the Crawler in a docker environment.

@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Aug 2, 2022

Also what happens to Selenium when we disable /dev/shm?

Selenium will not change any behavior, the behavior will be changed only on Chrome (--disable-dev-shm-usage)

Do you mean users would need to run chmod +x /dev/shm before using the Crawler? Or is there also an issue of permissions where that chmod command could be rejected?

Chrome will try to write to /dev/shm if no data and temp directory is provided. Without proper permissions, it will crash. Furthermore, this directory need to have proper size available ( RAM ).

At this link there is one discussion about similar issues:

Or change to /tmp or another directory instead of /dev/shm even when /dev/shm is available, but too small (e.g. Docker).

It seems like there are some cons to having --disable-dev-shm-usage on by default, making me think allowing users to choose what options to pass is the way to go.

When you disable it, you are basically telling Chrome: write temporary data (e.g., profile, cache) to disk (/tmp, or a directory passed with parameter), and not to shared RAM. On most container environments, you can't write to shared RAM by default.

Maybe since haystack already is a heavy RAM framework if doing NLP, it would be better to disable Crawler from using RAM and GPU. So other NLP tasks on the same machine won't be affected. Chrome will use the disk.

I think the cons of disabling /dev/shm writing apply for testings scenarios (see quote below) like Selenium Grid and other where there are many processes and tabs (a dedicated testing environment), so disk I/O may increase. On haystack scenario, using RAM or GPU, on my perspective, is indeed not the preferred option.

Hey Aslushnikov, I'll see when I have time to run a clean benchmark.

My setup consists of puppeteer pods running on Kubernetes. These pods autoscale to an average of 10 concurrent puppeteer requests. With an absolute max of 20, and everything higher is queued, but this only happens on very high spikes while the autoscaler needs to scale. Each puppeteer request consists of an incoming HTTP request, that starts a new chromium browser, executes the script and returns either a rendered html or a screenshot.

If these pods are allocated enough CPU, the next bottleneck will be the disk IO, especially in a docker container. Running using --disable-dev-shm-usage will only impact this bottleneck further. To be honest to mitigate this bottleneck I even went further and moved the chromium profiles and chromium executable to /dev/shm completely. Since there we're multiple browsers opening/closing every second, causing repetitive disk reads and writes.

This approach allowed me to scale for peaks up to 400 concurrent sessions, and if needed probably a lot higher.

Now while --disable-dev-shm-usage is a perfectly valid choice when you're just running small tasks, and performance doesn't matter that much. I ultimately think this is a choice that should be left to the developer itself. Especially since not everyone is running in docker to begin with. Otherwise we might start adding flags like --disable-gpu or --no-sandbox by default as well, since these also cause issues in docker.

My proposal would be to leave these flags out of any default configuration, but add a clear FAQ with docker caveats instead.

About the GPU option, you may see puppeteer devs comment in same reference issue:

In fact, we do still add --disable-gpu because enough users were having problems across platforms.

Yet at the same referenced issue, many other issues referencing that issue are shown, and most projects have made the option to disable shared memory writing. But, also, there are problems when writing to disk on heavy testing scenarios. What makes me think about the default option ("most" safe), and the possibility to set user-defined options. 🤔

What about allowing the user to send a list of parameters to be sent to Chrome, and if none sent, use a predefined set of options, which would be the safest for a scenario where we can't:

  • control container image used
  • control disk permissions applied (maybe too intrusive?)

And we, definitely, know that someone, who is using haystack, will be doing NLP, so the chances of RAM/GPU being in trouble are larger.

And a last point: GPU instances are expensive in the cloud. No need to allow a simple Crawler utility to use it.

@TuanaCelik
Copy link
Contributor

Hey @sjrl and @danielbichuetti - As far as I can tell this Issue is related to PR #2921 which has been closed. I will close this issue since the PR was merged. Let me know if that's wrong and I will re-open

@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Aug 4, 2022

@TuanaCelik Hi

This issue is not related to #2921. I found it after that, when doing tests on Azure ML notebooks. Where it won't work by default. Furthermore, it won't work on other docker container environment due to not disabling shm access.

This issue is related to how the Chrome driver is set up on the Crawler init (the options being sent to Chrome process command-line)

@TuanaCelik TuanaCelik reopened this Aug 4, 2022
@TuanaCelik
Copy link
Contributor

@danielbichuetti - Got it thanks for the context, I've re-opened it now :)

@sjrl
Copy link
Contributor

sjrl commented Aug 4, 2022

What about allowing the user to send a list of parameters to be sent to Chrome, and if none sent, use a predefined set of options, which would be the safest for a scenario where we can't:

@danielbichuetti I agree this sounds like the best idea. Would you be willing to go ahead and open a PR for this? And for the predefined set of options, I think making it clear in the documentation which ones are being used would be very helpful.

@danielbichuetti
Copy link
Contributor Author

@sjrl Sure! I will open the PR. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:crawler type:bug Something isn't working
Projects
None yet
3 participants