Skip to content

Latest commit

 

History

History
159 lines (89 loc) · 4.78 KB

readme.md

File metadata and controls

159 lines (89 loc) · 4.78 KB

Dark Scrubber

Features

  • Scanning of multiple websites
  • Updates sent by email
  • Anti-detection
    • Variable IP through use of proxy addresses
    • Variable check intervals
    • Usage of the pyppeteer_stealth package
  • Packaging into standalone application using PyInstaller
  • Automatic update of web browser upon start and during runtime

Installation

Tested working on Ubuntu 20.04 and Windows 10.

Python environment

.. code-block:: console

   $ conda create --name site_scraper python=3.10
   $ <python> -m pip install -r requirements.txt

Dependencies

See requirements.txt


Configuration

Main configuration is specified in config_base.py, which contains the base Config class.

Template application

The /applications/template/ folder contains template files and instructions for creating your own scraper.

Sending emails

The scraper sends emails from the supplied email address to itself. For setting up the email functionality: Ensure that the mail_info attribute in the Config class contains the following attributes:

  • addr: gmail address with working application password. See google help for setting application passwords.
  • app_pw: application password.

File structure

  • The /applications/ directory contains various scraper applications:

    • template: Boilerplate for creating your own scraper.

    • PyInstaller: Compiling binary executables of the programs.


  • The /lib/ directory contains python scripts for:

    • scraper.py contains Scraper class.

      • Base scraper class.
    • browser.py contains Browser, Puppeteer, SeleniumFireFox classes.

      • Puppeteer uses Chromium, other uses the FireFox browser of the selenium package.
      • puppeteer-stealth functionality only available when using the Puppeteer browser.
    • proxy.py contains ProxyRequester class.

      • Performs proxy-related features.
    • proxy_utils.py contains proxy-utility classes.

      • Performs proxy-related features.
    • mailer.py contains Mailer class.

      • Base mailer class.
    • utils.py contains shared functions.


  • The /htmls/ directory contains example .html files which are used in sending emails.

Proxy functionality

Enable use of proxy functionality by setting use_proxy=True in the Config class.

Features


PyInstaller

Creates a standalone executable package of the scraper. Can be performed by running the make_executable.py script inside the application sub-folders.

  • Notes:
    • Creates an executable only compatible with the OS it is created in.
    • Run the make_executable.py script with --onefile argument to create a single executable file. Else, the packaged program folder will consist of, besides the target executable file, a collection of linked library files.
    • UPX can be used for compressing the compiled executable. Must be visible to PyInstaller by being in the PATH environment variable, or --upx-dir supplied to PyInstaller.

Troubleshooting

When having trouble connecting on linux:

  • ensure in /etc/hosts: 127.0.0.1 localhost


Start on boot using systemctl daemon

The /systemctl/ folder contains an example systemctl daemon configuration file which can be used to run this script automatically from boot on unix systems. The following code block displays how to enable the systemd service daemon. For more information, see the tutorial in this link.

.. code-block:: console

    $ /etc/systemd/system/example_scraper.service       # service file location
    $ systemctl start example_scraper                   # start service
    $ systemctl enable example_scraper                  # automatically start on boot


Further development options

  • Other methods of notification delivery integration