Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter #3043

Merged

Conversation

danielbichuetti
Copy link
Contributor

@danielbichuetti danielbichuetti commented Aug 15, 2022

Related Issues

Proposed Changes:

  • Add a webdriver_options parameter, so the user can set the parameters that his specific scenario demands
  • Automatically set some Webdriver options in Azure ML and Google Colab to avoid crashes
    • --disable-dev-shm-usage
  • Automatically set some Webdriver options in scenario where Chrome will run as root to avoid crashes
  • --no-sandbox
  • Change default parameters:
    • --disable-gpu There is no need to enable GPU acceleration on a text crawler. GPU is expensive in the cloud.
    • --disable-dev-shm-usage Usually in container environments there is no access to shared memory, or it's set with the default size of 64 MB. Disabling its usage, Chrome will write to a temporary directory. Using shared memory may improve performance just in high testing workloads, which is not our use case.
      - --single-process As haystack doesn't support multi-tabs, there is no point to let Chrome spawn multiple processes. It will just increase the memory footprint.

How did you test it?

  • Unit tests
  • Manual tests based on python:3.9-slim-bullseye container image
  • Manual tests inside Azure ML
  • Manual tests inside haystack Dockerfiles

Notes for the reviewer

This changes essentially will make haystack Crawler node able to run in multiple environments without any user extra effort. In case any user has more specific needs, it will be possible using the parameter.

Checklist

@danielbichuetti danielbichuetti requested review from a team as code owners August 15, 2022 17:12
@danielbichuetti danielbichuetti changed the title [[BUG] Update crawler webdriver options [BUG] Update crawler webdriver options Aug 15, 2022
@danielbichuetti danielbichuetti changed the title [BUG] Update crawler webdriver options bug: update crawler webdriver options Aug 15, 2022
Copy link
Contributor

@agnieszka-m agnieszka-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I added some minor language comments to comply with our writing guidelines. Thanks!

haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
Copy link
Contributor

@agnieszka-m agnieszka-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@danielbichuetti danielbichuetti changed the title bug: update crawler webdriver options fix: update crawler webdriver options Aug 16, 2022
@danielbichuetti danielbichuetti requested a review from sjrl August 18, 2022 18:56
@sjrl
Copy link
Contributor

sjrl commented Aug 19, 2022

@danielbichuetti This is looking good to me! Although I think we should have someone from the core-engineering team for the final approval.

@danielbichuetti danielbichuetti changed the title fix: update crawler webdriver options bug: update crawler webdriver options Aug 19, 2022
@danielbichuetti danielbichuetti changed the title bug: update crawler webdriver options bug: fix Crawler on restricted environment containers and add Webdriver options as parameter Aug 19, 2022
@danielbichuetti danielbichuetti changed the title bug: fix Crawler on restricted environment containers and add Webdriver options as parameter bug: fix Crawler on restricted environment containers and add ChromeDriver options as parameter Aug 19, 2022
@danielbichuetti danielbichuetti changed the title bug: fix Crawler on restricted environment containers and add ChromeDriver options as parameter fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter Aug 19, 2022
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

@ZanSara ZanSara merged commit d715d02 into deepset-ai:main Aug 22, 2022
@danielbichuetti danielbichuetti deleted the update_crawler_webdriver_options branch August 22, 2022 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crawler is crashing in some container environments
5 participants