A Python script that recursively downloads files from a webpage and links within that page using a console or by importing the script. Single page downloading and page component filter and other configurations are available.
Source:
Python 3 is required to run the script.
Clone the repository, enter the directory and run the following line to install the script's dependencies:
pip install -r requirements.txt
Binary:
If a binary has been precompiled for your platform, it will be available in the releases section and no further steps are required(Most recent binaries are also available inside the bin folder).
Binaries are generated using Nuitka.
Command: Run the relevant file with any additional flags:
./recursivescrape[.py/.exe/Linux64] [flags]
python ./recursivescrape.py [flags]
The available flags are:
Flag | Description | Default |
---|---|---|
-h, --help | Show the help page of the program and all available flags | |
-u, --url | URL to start from. REQUIRED | |
-p, --download-path | Directory to download files to. Will use the current directory by default. | |
-c, --cookies | Cookie values as needed in the json format. Example: {"session":"12kmjyu72yberuykd57"} | {} |
--id | Component id that contains the files and following paths. by default will check the whole page. | |
-o, --overwrite | Download and overwrite existing files. If not added, files that already exist will be skipped. | False |
-r, --resume | Resume previous progress from file PROGRESS_FILE, will ignore url and no-recursion arguments if file is found. | False |
-bi, --backup-interval | Saves the current progress every BACKUP_INTERVAL pages, 0 will disable automatic backup. | 0 |
-f, --progress-file | The file to save and load progress with, relative to the download path. | progress.dat |
-l, --dont-prevent-loops | Save memory by not remembering past pages but increase the chance of checking pages multiple times, do not add if there are any loops in the directory. Changing this flag between resumed runs results in undefined behaviour. | False |
-nr, --no-recursion | Only download files from the given url and do not follow links recursively | False |
--concurrent | Amount of pages and files to download concurrently at most | 10 |
-v, --verbose | Increase output detail. use -vv for even more detail. |
Code:
Place the script in the same folder as your file(or your python import path) and import it:
import recursivescrape
Call the scrape function with the same flags that are available using the script, only root_url is strictly required:
recursivescrape.scrape(
root_url: str,
download_path: str = None,
cookies: dict = {},
id: str = "",
overwrite: bool = False,
resume: bool = False,
progress_file: str = "progress.dat",
dont_prevent_loops: bool = True,
no_recursion: bool = False,
backup_interval: int = 0,
verbosity: int = 0,
concurrent: int = 10)
Run the relevant script from the bin folder:
./generateLinuxBin.sh
.\generateWindowsBin.bat
The script will create a venv, install all the needed packages into it, run the compile command and save the binary in the current folder. The compilation will include a few small downloads depending on the platform.
After compilation run the relevant clean script to remove the unneeded files:
./cleanLinuxBuildFiles.sh
.\cleanWindowsBuildFiles.bat
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to test changes before sending a request.