- Scanning of multiple websites
- Updates sent by email
- Anti-detection
- Variable IP through use of proxy addresses
- Variable check intervals
- Usage of the pyppeteer_stealth package
- Packaging into standalone application using PyInstaller
- Automatic update of web browser upon start and during runtime
Tested working on Ubuntu 20.04 and Windows 10.
.. code-block:: console
$ conda create --name site_scraper python=3.10
$ <python> -m pip install -r requirements.txt
See requirements.txt
Main configuration is specified in config_base.py, which contains the base Config
class.
The /applications/template/ folder contains template files and instructions for creating your own scraper.
The scraper sends emails from the supplied email address to itself. For setting up the email functionality: Ensure that the mail_info
attribute in the Config
class contains the following attributes:
addr
: gmail address with working application password. See google help for setting application passwords.app_pw
: application password.
-
The
/applications/
directory contains various scraper applications:-
template: Boilerplate for creating your own scraper.
-
PyInstaller: Compiling binary executables of the programs.
-
-
The
/lib/
directory contains python scripts for:-
scraper.py contains
Scraper
class.- Base scraper class.
-
browser.py contains
Browser
,Puppeteer
,SeleniumFireFox
classes.- Puppeteer uses Chromium, other uses the FireFox browser of the
selenium
package. puppeteer-stealth
functionality only available when using the Puppeteer browser.
- Puppeteer uses Chromium, other uses the FireFox browser of the
-
proxy.py contains
ProxyRequester
class.- Performs proxy-related features.
-
proxy_utils.py contains proxy-utility classes.
- Performs proxy-related features.
-
mailer.py contains
Mailer
class.- Base mailer class.
-
utils.py contains shared functions.
-
- The
/htmls/
directory contains example .html files which are used in sending emails.
Enable use of proxy functionality by setting use_proxy=True
in the Config
class.
- Scours the following websites for free proxies:
- Assesses operating status of proxies before utilization. See
Config
class for testing parameters at these lines. - Proxies can be filtered for source country.
- Utilization of multiprocessing for retrieving and testing of new proxies to reduce time.
Creates a standalone executable package of the scraper. Can be performed by running the make_executable.py
script inside the application sub-folders.
- Notes:
- Creates an executable only compatible with the OS it is created in.
- Run the
make_executable.py
script with--onefile
argument to create a single executable file. Else, the packaged program folder will consist of, besides the target executable file, a collection of linked library files. - UPX can be used for compressing the compiled executable. Must be visible to PyInstaller by being in the
PATH
environment variable, or--upx-dir
supplied to PyInstaller.
When having trouble connecting on linux:
- ensure in
/etc/hosts
:127.0.0.1 localhost
The /systemctl/
folder contains an example systemctl daemon configuration file which can be used to run this script automatically from boot on unix systems. The following code block displays how to enable the systemd service daemon. For more information, see the tutorial in this link.
.. code-block:: console
$ /etc/systemd/system/example_scraper.service # service file location
$ systemctl start example_scraper # start service
$ systemctl enable example_scraper # automatically start on boot
- Other methods of notification delivery integration
- Gotify / Pushbullet
- Slack