dl
is a program that lets you download big files fast(er) by parallelizing downloads across many threads and cores.
It was written in response to a take-home challenge for a software company, but lives on as an interesting playground for playing with network programming in Rust!
Table Of Contents
Assuming you don't have the repo yet, get it with:
git clone https://0xacab.org/aguestuser/dl.git
cd dl
You will need rust
and its build tool, cargo
, installed to run dl
. The fastest/simplest way to install them is by running this in a shell:
curl https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env
If curl
-ing bash scripts from the internet into your shell makes you nervous, you can try one of the options here.
Whatever method you chose, you can check to make sure everything worked with:
rustc --version
cargo --version
If you would like to update all your packages (likely unnecessary), you can run:
rustup update
Yay! Now that you have rust
now, you can build dl
with:
cd <path/to/this/repo>
cargo build --release
To make it easier to run, let's put the build artifact on our PATH:
sudo ln -s ${PWD}/target/release/dl /usr/bin/dl
(If you don't want to do that, that's okay! We can use a slightly longer command later!)
Okay! Time for the main event! Now you can run dl with:
dl <url_to_download_from> <path_to_save_file_to>
(Filling in <url_to_download_from> and <path_to_save_file_to> with your own values.)
For example, to download this ~20mb copy of Designing Data-Intensive Applications
(disguised as my resume on RC's S3 bucket), you could run:
dl https://recurse-uploads-production.s3.amazonaws.com/dc12e4d0-3c82-45b8-9cb7-6c64a8f50cfb/austin_guest_resume.pdf data/book.pdf
If you didn't symlink the build artifact above, you could run:
./target/release/dl <url> <path>
In the above we used production builds because they are faster, and this is a challenge! However, if we wanted to hack on the project to change it, we'd want faster build/run cycle than come with the release flag and invoking a binary.
Thankfully, we can build and run in one step with a simple:
cargo run
We can build and run the tests (with frequency, right?) with:
cargo test
If you want to check out the (nifty, autogenerated!) docs, you can always run:
cargo doc --open
And if you want to just re-compile, you can run:
cargo build
If you want to find your way around the code, starting in main
, then looking at lib
and to the modules that it calls is a good way in.
The main action happens in lib::run
, which is structured as a pipeline of futures, each of which consumes and produces a struct that encapsulates the current state of the program.
More specifically, program flow is modeled a transition between the following structs:
Config -> MetadataDownloader -> FileDownloader -> HashChecker
This encoding of application state as transitions between structs is a rust convention which, in addition to trying to provide clarity as to what phase of execution the app is in, does some memory management work for us under the hood.
As should be evident from the above, the flow of the program is to:
- accept configs from the user
- make a
HEAD
request to the user-supplied url to see if it supports range requests, and if so, retrieve the length of the file and (if it exists) its etag - download the file in several parallel chunks
- take the hash of the downloaded file to see if it matches the advertised etag
Most of the heavy lifting comes in dl::file::FileDownloader::fetch
. This function:
- creates a blank placeholder file into which chunks will be written as soon as they come off the wire
- calculates optimally-sized chunks in which to download the file (where "optimal" is file_size / P; P = degree of parallelism in app) and generates a stream of chunk-sized offsets to use in range requests
- issues parallel range requests for the chunk at each offset (where each request is represented as a future and the set of all requests is represented as a stream of futures, derrived from our original stream of offsets)
- reads the bytes of each response into a buffer, whose contents are written to the placeholder file after seeking to the correct offset (note: all writes are performed in parallel via stream composition)
- collects the stream of parallel futures described above into a single future (via chained calls to
buffer_unordered
andcollect
) which resolves successfully if all requests resolve successfully and with failure if any request fails (yes: we could be less brittle than that in future iterations!**
A brief note on types:
Each call to download_piece
returns a value that we represent as a Future<u64, DlError>
. The u64
represents the offset of the downloaded chunk-- in case we wanted to track the status of each request for retries, which we don't currently do.
TheDlError
is a custom error class that serves as an enumerable set of all possible errors we might see in the program's execution. Constructing this type involves a lot of boilerplate. (See dl::errors
). BUT... having this type around turns out to do a lot of work. For starters, it's necessary to make the "types line up" for all of our variously-typed futures to return an error component of the same type, which allows us to use (monadic) map
, map_err
, and and_then
combinators to manage control flow.
It also makes error-handling in the main loop a breeze, since we have a strong guarantee that our main loop will only ever produce (future) errors of a certain type and that execution will short circuit as soon as one is encountered. Thus, we can just call our main function and map_err
over its results to print the correct error at the correct time without worrying too much about it.
If you see a Box<Future>
and wonder what's going on: its main purpose is to make sure the compiler has enough information about what kind of future we're returning when we compose Futures of slightly varying types.
A note on parallelism:
There are several strategies I could have used to leverage parallelism in this application. I chose the simplest one I could think of:
- Discover the number of logical cores on my machine (in my case 12). Call this number
P
, and assume it is the optimal level of parallelism for the application.[1] - Divide the file into
P
pieces and download them in a stream ofP
parallel tasks (i.e., "futures") using an https client configured with a thread pool of sizeP
- When all tasks have completed, collect their results using built-in stream combinators provided with the tokio framework.
I could have alternately spawned P
threads or P
tasks and had them send their results over a channel to an aggregator thread or task (which I may try out in further experimentation), but for current purposes, this seemed the most straightforward approach.
[1] See below for benchmarking experiments to test this assumption.
To make sure the program was successfully leveraging parallelism to increase performance (and validate my assumption of the optimal level of parallelization), I performed some elementary profiling with various levels of parallelism.
The benchmarks from these profiling experiments live with the repo. You can view the reports with (for example):
cd <this repo>
firefox target/criterion/report/index.html
I performed the benchmarks on a Thinkpad with 12 logical (6 physical) cores with internet speeds of ~850Mbps up / 930Mbps down. For each trial, I downloaded files of varying sizes (~50 KB, ~25 MB, and ~500 MB) with varying levels of parallelism (1, 6, 12, 24, 48) -- running 20 trials per permutation.
The benchmarks demonstrated that:
- Increasing parallelism did not yield measurable performance gains for files on the order of KB (presumably b/c the overhead of context switching dominates any gains from parallelism given the small amount of data to be downloaded and the speed of the network on which I was testing).
- Gains from parallelism begin accruing for files on the order of MB and increased as a factor of file size thereafter -- producing speed ups of 2x and 3x for files of size ~25 MB and ~500 MB, respectively.
- At sizes large enough for parallelism to increase performance, setting the parallelism level
P
to the number of cores on the machine (in my case 12), was a more or less optimal choice.[1]
[1] More specifically: For files of size ~25 MB, download times decreased as P
increased from 1 to 12, then increased again for values of P
higher than 12. For files of size ~500 MB, download times decreased hyperbolically as P
increased, with the reflection point of the hyperbola appearing somewhere before P=12. Since in the first case 12 was the optimal choice and in the latter case, download times only decreased asymtotically after reaching 12, setting P
=12 seems to be an optimal choice in a situation in which we must choose a single vale of P
for all file sizes and network speeds.
As noted above, my solution has two major flaws:
- It does not retry failed chunk requests and suffers halting errors when they happen
Additionally:
- It relies on discovering the file of the size in advance (to calculate chunk sizes) but has no fallback measures in place for discovering that information in the case that the target server does not supply it in the response to a
HEAD
request.
If I had more time to work on this project, I'd want to solve the above problems in roughly that order. Here are some preliminary thoughts on how I'd do so:
To handle failed requests, first we'd want to track the status of all requests. We'd need to store this data in some sort of thread-safe data structure that a bunch of parallel threads could read from and write to in parallel without producing contention or draining resources in acquiring locks, etc. My data structure of choice would be some sort of concurrent hash map. It looks like rust has a decent implementation in chashmap.
When FileDownloader#fetch
was called, I'd pass it an empty chashmap
. As requests or writes fail and/or succeed I'd write record that fact to an entry in the chashmap
. As a first approximation, let's say the keys of the map would be the number of the offset, and the values would be an enum with possible values Success
, FailedRequest
, FailedWrite
. If either a request or a write failed, one of the futures in the stream being collected into one future would fail, meaning that the entire future would fail. I'd map_err
over this future and recursively call fetch
again, feeding it the (now not-empty) version of the hash-map.
On subsequent calls, fetch
should omit offset whose keys contain Success
values from the stream of offsets it generates for range requests. (Which means that on initial call, it should also consult the empty map to help it construct a full set of offsets.)
To add in further resilience, we could consider the case that execution might terminate in the middle of downloading or writing. To recover from this, we could serialize our chashmap
on some sane interval and write it to disk, and always make an attempt to build the initial chashmap
that we pass into fetch
by deserializing whatever we do (or don't) have stored to disk.
I left a seam for introducing a functional take on a "strategy pattern" solution to this problem in MetadataDownloader::fetch
. The idea would be that we would call a pipeline of composed functions to try to discover the file size. If any of them find the answer, we return early. If all of them fail, we return an error, causing the caller's execution to short circuit.
I can think of 3 strategies to try (in increasing order of cleverness):
- Ask for the length in a
HEAD
request (what we currently do). If that fails... - Issue a range request for bytes 0-1 of the file. Inspect the
Content-Lenghth
and/orContent-Range
headers and extract the size of the file from the header values, if present. If that fails... - Perform a binary-search like search for the last byte, when you find it, return its position as the file size
As (3) is the only non-trivial solution, its implementation bears some discussion. I'd imagine something like the following:
- write a recursive function that request a one byte chunk of the file at index 2^N, with N starting at 0. if you get data in response increase N by 1 until you don't see a response.
- remeber the first
N
at which you didn't get a response and the lastN
at which you did. call the formerupper
and the latterlower
- write a recursive function that implements binary search between
upper
andlower
:- query the server for data at an index halfway between
upper
andlower
- if you find data at an index, set
lower
to that index and recurse - if you don't find data at an index, set
upper
to that index and recurse - terminate the recursion when
lower
is one less thanupper
, and returnlower
- query the server for data at an index halfway between
This should find the size of the file in roughly O(log N)
, where N
is the size of the file in bytes.