Multiprocessing HTTP Download (Python)

This is a python module that parses an input file which stores a filename to its corresponding download url, and downloads from a range of download urls parallely. This program is particularly useful for downloading from multiple urls, it is not effective in downloading from a single URL and you would benefit more from a sequential downloader. This downloader also re-downloads from the last breakpoint if your program halts or gets interrupted.

This project was intended to download multiple (>1000) large files (>10GB) reliably.

1 How to use

Prepare your input file list - The first line (0 index) is ignored during parsing. Make sure that you have at least one line at the first index before your relevant filenames and urls. (Refer to example below)

Note: Your output folder name will follow the input filename. You do not need to manually create an output folder.

Open main.py and edit the filepath variable under __main__ method to point to your input file that you have created in step 1.
Save your main.py file and enter "python main.py" into your command line.
Observe that the output folder has been created and will begin to fill that folder multiple files at a time.

Note: Mock inputs of varying sizes (/input/tmp has 7 files of 10GB) is provided inside the ./input folder

2 How it works

The parser implementation is trivial and will not be covered in this section as you are free to change how you would like to parse multiple urls.

Multiprocessing

The inbuilt python module concurrent.futures is used which automatically determines the number of worker threads to spawn to handle the workload. This is system dependent and it normally defaults to your system number of cores multiplied by 5. Each concurrent.futures thread handles the simultaneously downloading and unpacking of one url into the target file.

Note: Refer to this link to learn how to use ThreadPoolExecutor

Redownloader

You will notice this function which checks if the current file size in your respective /output/<dir_name> matches the content length of the http download request. It will begin from the range of the current size if the file has been partially downloaded.

Downloading and writing to file

raise_for_status() was used to maintain a persistent HTTP connection which prevents excessive "3-way-handshake" for every chunk.

shutil.copyfileobj had to be used instead of the conventional f.write() as it often resulted in incompleted file extraction/saving as the buffer was not cleared. This is specifically the case when the file size was large (1GB) and you have multiple concurrent downloads and extractions happening. This method ensures that the buffer is actively cleared which prevents the situation of the memory running out of space and eventually crashing the entire program when the buffer is unable to flush.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
input		input
output		output
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multiprocessing HTTP Download (Python)

1 How to use

2 How it works

Multiprocessing

Redownloader

Downloading and writing to file

About

Releases

Packages

Languages

Berttwm/Multiprocessing-HTTP-Downloader

Folders and files

Latest commit

History

Repository files navigation

Multiprocessing HTTP Download (Python)

1 How to use

2 How it works

Multiprocessing

Redownloader

Downloading and writing to file

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages