Skip to content

HTTP Downloader which parses an input list file of `filenames : url` and saves the files into an output directory.

Notifications You must be signed in to change notification settings

Berttwm/Multiprocessing-HTTP-Downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multiprocessing HTTP Download (Python)

This is a python module that parses an input file which stores a filename to its corresponding download url, and downloads from a range of download urls parallely. This program is particularly useful for downloading from multiple urls, it is not effective in downloading from a single URL and you would benefit more from a sequential downloader. This downloader also re-downloads from the last breakpoint if your program halts or gets interrupted.

This project was intended to download multiple (>1000) large files (>10GB) reliably.

1 How to use

  1. Prepare your input file list - The first line (0 index) is ignored during parsing. Make sure that you have at least one line at the first index before your relevant filenames and urls. (Refer to example below) image

Note: Your output folder name will follow the input filename. You do not need to manually create an output folder.

  1. Open main.py and edit the filepath variable under __main__ method to point to your input file that you have created in step 1. image

  2. Save your main.py file and enter "python main.py" into your command line.

  3. Observe that the output folder has been created and will begin to fill that folder multiple files at a time.

Note: Mock inputs of varying sizes (/input/tmp has 7 files of 10GB) is provided inside the ./input folder

2 How it works

The parser implementation is trivial and will not be covered in this section as you are free to change how you would like to parse multiple urls.

Multiprocessing

The inbuilt python module concurrent.futures is used which automatically determines the number of worker threads to spawn to handle the workload. This is system dependent and it normally defaults to your system number of cores multiplied by 5. Each concurrent.futures thread handles the simultaneously downloading and unpacking of one url into the target file.
image

Note: Refer to this link to learn how to use ThreadPoolExecutor

Redownloader

You will notice this function which checks if the current file size in your respective /output/<dir_name> matches the content length of the http download request. It will begin from the range of the current size if the file has been partially downloaded. image

Downloading and writing to file

raise_for_status() was used to maintain a persistent HTTP connection which prevents excessive "3-way-handshake" for every chunk.

shutil.copyfileobj had to be used instead of the conventional f.write() as it often resulted in incompleted file extraction/saving as the buffer was not cleared. This is specifically the case when the file size was large (1GB) and you have multiple concurrent downloads and extractions happening. This method ensures that the buffer is actively cleared which prevents the situation of the memory running out of space and eventually crashing the entire program when the buffer is unable to flush. image

About

HTTP Downloader which parses an input list file of `filenames : url` and saves the files into an output directory.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages