Skip to content

Latest commit

 

History

History
138 lines (94 loc) · 6.27 KB

README.md

File metadata and controls

138 lines (94 loc) · 6.27 KB

Streamlit Selenium Test

Streamlit project to test Selenium running in Streamlit Cloud runtime.

  • Local Windows 10 machine works
  • Local Docker container works
  • Streamlit Cloud runtime works, see example app here: Docker

Issues 🐛

  • Example fails on Streamlit Cloud with a TimeoutException, due to a 403 response, because GeoIP blocking is active on the target website. Therefore a proxy can be enabled optionally to bypass this.
  • However, the proxies are not very reliable, because only free proxies are used here. Therefore, the example is not very stable with enabled proxies and can fail sometimes. Sometimes, no proxies are available.

ToDo ☑️

  • improve example
  • fix proxy issues
  • try also undetected_chromedriver package
  • try also seleniumbase package

Problem 🤔

The suggestion for this repo came from a post on the Streamlit Community Forum.

https://discuss.streamlit.io/t/issue-with-selenium-on-a-streamlit-app/11563

It is not that easy to install and use Selenium based webscraper in container based environments. On the local computer, this usually works much more smoothly because a browser is already installed and can be controlled by the associated webdriver. In container-based environments, however, headless operation is mandatory because no UI can be used there.

Therefore, in this repository a small example is given to get Selenium working on:

  • Local Windows 10 machine
  • Local Docker container that mimics the Streamlit Cloud runtime
  • Streamlit Community Cloud runtime

Proxy 😎

Because some websites block requests based on countries (aka geoip blocking) or from certain IP ranges, a proxy can be used to bypass this. The example app has a checkbox to enable a proxy. You can choose between socks4 and socks5 proxies. However, socks4 does not work at all. The socks5 proxy is a free socks5 proxy from a public list and is not very reliable. Therefore, the example is not very stable with enabled proxies and can fail quite often.

Pitfalls 🚩

  • To use Selenium (even headless in a container) you need always two components to be installed on your machine:
    • A webbrowser and its associated webdriver.
  • The version of the headless webbrowser and its associated webdriver must always match.
  • If your are using Selenium in a docker container or on Streamlit Cloud, the --headless option is mandatory, because there is no graphical user interface available.
  • There are three options of webbrowser/webdriver combinations for Selenium:
    1. chrome & chromedriver
    2. chromium & chromedriver
    3. firefox & geckodriver
  • Unfortunately in the default Debian Bullseye apt package repositories, not all of these packages are available. If we want an installation from the default repositories, only chromium & chromedriver is left.
  • The chromedriver has a lot of options, that can be set. It may be necessary to tweak these options on different platforms to make headless operation work.
  • The chromedriver, selenium and its options change quite a lot over time. A lot of information on stackoverflow regarding chromedriver/selenium is outdated.
  • The deployment to Streamlit Cloud has unfortunately failed sometimes in the past. A concrete cause of the error or an informative error message could not be identified. Currently it seems to be stable on Streamlit Cloud.
  • To run this streamlit app on Windows, the Windows chromedriver.exe must be stored here in the root folder or added to the Windows PATH. Be aware, that the version of this chromedriver must match the version of your installed Chrome browser.

Development Setup 🛠️

In the Streamlit Cloud runtime, neither chrome, chromedriver nor geckodriver are available in the default apt package sources.

The Streamlit Cloud runtime seems to be very similar to the official docker image python:3.XX-slim-bullseye on Docker Hub, which is based on Debian Bullseye.

In this repository a Dockerfile is provided that mimics the Streamlit Cloud runtime. It can be used for local testing.

A packages.txt is provided with the following minimal content:

chromium
chromium-driver

A requirements.txt is provided with the following minimal content:

streamlit
selenium

Docker 🐋

Docker Container local

The provided Dockerfile tries to mimic the Streamlit Cloud runtime.

Build local custom Docker Image from Dockerfile

docker build --progress=plain --tag selenium:latest .

Run custom Docker Container

docker run -ti -p 8501:8501 --rm selenium:latest
docker run -ti -p 8501:8501 --rm selenium:latest /bin/bash
docker run -ti -p 8501:8501 -v $(pwd):/app --rm selenium:latest  # linux
docker run -ti -p 8501:8501 -v ${pwd}:/app --rm selenium:latest  # powershell
docker run -ti -p 8501:8501 -v %cd%:/app --rm selenium:latest    # cmd.exe

Selenium 👁️

https://selenium-python.readthedocs.io/getting-started.html

pip install selenium

Chromium 🕸️

Required packages to install

apt install chromium
apt install chromium-driver

Chromium Options

https://peter.sh/experiments/chromium-command-line-switches/

undetected_chromedriver 🤷‍♂️

Another option to try, not yet done...

Status ✔️

Last changed: 2024-06-13