Crawler is crashing in some container environments #2946

danielbichuetti · 2022-08-02T09:49:08Z

Describe the bug
Crawler is crashing when running on some containers, e.g., Azure ML and OpenShift.

Error message
In a WebDriver exception, Chrome driver detect Chrome stopped responding, and say it probably crashed.

Expected behavior
Crawler should run without issues in containers.

Investigation
After some investigation, this points to be a similar error that occurred in Google Colab environment. Without proper flags, Chrome will try to use /dev/shm, and in most container images there is no permission for that (except if you set).
Going to Selenium repository where they store their Dockerfiles, show us their “recommended” setup for a container image. Some important files:

The base image is the most important regarding crashes in general. Disabling shm, like on Colab code, make it unnecessary to set up permissions on shm for some environments.
Furthermore, I must make a note that sometimes Chrome crashes if you don't disable audio and pulseaudio is not installed (this has been tested on AWS Lambda container). So maybe haystack Dockerfile should include it, or Crawler should disable audio.
When using nvidia CUDA images, disabling Chrome GPU usage also decreased random crashes.
As a side note, using Chrome option to spawn a single process saves some memory.

I've made this test branch, running it on Azure ML, Openshift, EKS and AKS increased stability, without Crawler crashes. Obs: this is just a fast test branch, not the one supposed for a PR (if).

Perhaps the AzureML settings should be made default, since haystack doesn't allow user to configure Selenium WebDriver options. Setting the safest options would be, possibly, the recommended.

To Reproduce
Create a Dockerfile with Haystack (Crawler enabled), use it to generate an environment in Azure ML. Try to run Crawler using Jobs.

FAQ Check

Have you had a look at our new FAQ page?

System:

OS: Linux
GPU/CPU: NVIDIA V100
Haystack version (commit or version number): 1.6.1rc

sjrl · 2022-08-02T11:14:03Z

Perhaps the AzureML settings should be made default, since haystack doesn't allow user to configure Selenium WebDriver options. Setting the safest options would be, possibly, the recommended.

Perhaps another solution would be to add options to the initialization of the Crawler so the user could configure Selenium WebDriver options. I agree we should find a reasonable set of default options, but it is unlikely we can find the "safest" options for all scenarios.

Also what happens to Selenium when we disable /dev/shm?

sjrl · 2022-08-02T11:17:38Z

Without proper flags, Chrome will try to use /dev/shm, and in most container images there is no permission for that (except if you set).

Do you mean users would need to run chmod +x /dev/shm before using the Crawler? Or is there also an issue of permissions where that chmod command could be rejected? We could consider adding a try-except to the Crawler to run that chmod command and if it fails pass the option --disable-dev-shm-usage.

sjrl · 2022-08-02T11:25:35Z

There is a pretty good discussion on the issues the chrome webdriver has in docker environments in general puppeteer/puppeteer#1834. It seems like there are some cons to having --disable-dev-shm-usage on by default, making me think allowing users to choose what options to pass is the way to go. And to add documentation warning about the common issues that one might face when running the Crawler in a docker environment.

danielbichuetti · 2022-08-02T11:48:58Z

Also what happens to Selenium when we disable /dev/shm?

Selenium will not change any behavior, the behavior will be changed only on Chrome (--disable-dev-shm-usage)

Do you mean users would need to run chmod +x /dev/shm before using the Crawler? Or is there also an issue of permissions where that chmod command could be rejected?

Chrome will try to write to /dev/shm if no data and temp directory is provided. Without proper permissions, it will crash. Furthermore, this directory need to have proper size available ( RAM ).

At this link there is one discussion about similar issues:

Or change to /tmp or another directory instead of /dev/shm even when /dev/shm is available, but too small (e.g. Docker).

It seems like there are some cons to having --disable-dev-shm-usage on by default, making me think allowing users to choose what options to pass is the way to go.

When you disable it, you are basically telling Chrome: write temporary data (e.g., profile, cache) to disk (/tmp, or a directory passed with parameter), and not to shared RAM. On most container environments, you can't write to shared RAM by default.

Maybe since haystack already is a heavy RAM framework if doing NLP, it would be better to disable Crawler from using RAM and GPU. So other NLP tasks on the same machine won't be affected. Chrome will use the disk.

I think the cons of disabling /dev/shm writing apply for testings scenarios (see quote below) like Selenium Grid and other where there are many processes and tabs (a dedicated testing environment), so disk I/O may increase. On haystack scenario, using RAM or GPU, on my perspective, is indeed not the preferred option.

Hey Aslushnikov, I'll see when I have time to run a clean benchmark.

My setup consists of puppeteer pods running on Kubernetes. These pods autoscale to an average of 10 concurrent puppeteer requests. With an absolute max of 20, and everything higher is queued, but this only happens on very high spikes while the autoscaler needs to scale. Each puppeteer request consists of an incoming HTTP request, that starts a new chromium browser, executes the script and returns either a rendered html or a screenshot.

If these pods are allocated enough CPU, the next bottleneck will be the disk IO, especially in a docker container. Running using --disable-dev-shm-usage will only impact this bottleneck further. To be honest to mitigate this bottleneck I even went further and moved the chromium profiles and chromium executable to /dev/shm completely. Since there we're multiple browsers opening/closing every second, causing repetitive disk reads and writes.

This approach allowed me to scale for peaks up to 400 concurrent sessions, and if needed probably a lot higher.

Now while --disable-dev-shm-usage is a perfectly valid choice when you're just running small tasks, and performance doesn't matter that much. I ultimately think this is a choice that should be left to the developer itself. Especially since not everyone is running in docker to begin with. Otherwise we might start adding flags like --disable-gpu or --no-sandbox by default as well, since these also cause issues in docker.

My proposal would be to leave these flags out of any default configuration, but add a clear FAQ with docker caveats instead.

About the GPU option, you may see puppeteer devs comment in same reference issue:

In fact, we do still add --disable-gpu because enough users were having problems across platforms.

Yet at the same referenced issue, many other issues referencing that issue are shown, and most projects have made the option to disable shared memory writing. But, also, there are problems when writing to disk on heavy testing scenarios. What makes me think about the default option ("most" safe), and the possibility to set user-defined options. 🤔

What about allowing the user to send a list of parameters to be sent to Chrome, and if none sent, use a predefined set of options, which would be the safest for a scenario where we can't:

control container image used
control disk permissions applied (maybe too intrusive?)

And we, definitely, know that someone, who is using haystack, will be doing NLP, so the chances of RAM/GPU being in trouble are larger.

And a last point: GPU instances are expensive in the cloud. No need to allow a simple Crawler utility to use it.

TuanaCelik · 2022-08-04T09:31:19Z

Hey @sjrl and @danielbichuetti - As far as I can tell this Issue is related to PR #2921 which has been closed. I will close this issue since the PR was merged. Let me know if that's wrong and I will re-open

danielbichuetti · 2022-08-04T09:49:46Z

@TuanaCelik Hi

This issue is not related to #2921. I found it after that, when doing tests on Azure ML notebooks. Where it won't work by default. Furthermore, it won't work on other docker container environment due to not disabling shm access.

This issue is related to how the Chrome driver is set up on the Crawler init (the options being sent to Chrome process command-line)

TuanaCelik · 2022-08-04T09:57:48Z

@danielbichuetti - Got it thanks for the context, I've re-opened it now :)

sjrl · 2022-08-04T10:09:29Z

What about allowing the user to send a list of parameters to be sent to Chrome, and if none sent, use a predefined set of options, which would be the safest for a scenario where we can't:

@danielbichuetti I agree this sounds like the best idea. Would you be willing to go ahead and open a PR for this? And for the predefined set of options, I think making it clear in the documentation which ones are being used would be very helpful.

danielbichuetti · 2022-08-04T10:36:54Z

@sjrl Sure! I will open the PR. Thank you!

sjrl added topic:crawler journey:advanced type:bug Something isn't working labels Aug 2, 2022

TuanaCelik closed this as completed Aug 4, 2022

TuanaCelik reopened this Aug 4, 2022

danielbichuetti mentioned this issue Aug 15, 2022

fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter #3043

Merged

5 tasks

ZanSara closed this as completed in #3043 Aug 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler is crashing in some container environments #2946

Crawler is crashing in some container environments #2946

danielbichuetti commented Aug 2, 2022 •

edited

Loading

sjrl commented Aug 2, 2022

sjrl commented Aug 2, 2022

sjrl commented Aug 2, 2022

danielbichuetti commented Aug 2, 2022 •

edited

Loading

TuanaCelik commented Aug 4, 2022

danielbichuetti commented Aug 4, 2022 •

edited

Loading

TuanaCelik commented Aug 4, 2022

sjrl commented Aug 4, 2022

danielbichuetti commented Aug 4, 2022

Crawler is crashing in some container environments #2946

Crawler is crashing in some container environments #2946

Comments

danielbichuetti commented Aug 2, 2022 • edited Loading

sjrl commented Aug 2, 2022

sjrl commented Aug 2, 2022

sjrl commented Aug 2, 2022

danielbichuetti commented Aug 2, 2022 • edited Loading

TuanaCelik commented Aug 4, 2022

danielbichuetti commented Aug 4, 2022 • edited Loading

TuanaCelik commented Aug 4, 2022

sjrl commented Aug 4, 2022

danielbichuetti commented Aug 4, 2022

danielbichuetti commented Aug 2, 2022 •

edited

Loading

danielbichuetti commented Aug 2, 2022 •

edited

Loading

danielbichuetti commented Aug 4, 2022 •

edited

Loading