Low performance #1304
Replies: 1 comment 2 replies
-
As your workflow involves opening a file, reading a block and then closing it following by another set of open-read-close, streaming is not going to help you any bit here. In case of streaming once you close the file, whole data is purged and next open is going to redownload the file contents. File-cache is the best option you have here. In case of file-cache ensure you have timeout value set based on how long your application takes to open the file again. If a file times out before it then it will be deleted and next open will trigger file download again. In case your workflow is about reading all the files header first and then opening files again to process then it means you can get advantage of file-cache only if all the files are there in local cache, which technically means downloading all the files by opening them first and having enough disk space to accommodate all files. This is why in case of AzCopy you see better performance. If for any reason blobfuse runs into redownloading of file, then performance will drop. To take advantage of file-cache your application shall open a file read header, close it. Again, open it immediately and start processing so that file is not downloaded again when you start processing. |
Beta Was this translation helpful? Give feedback.
-
I am testing blobfuse for use in a data pipeline. The datasets consist of 2400 tiff-files which are 32 MB in size for a total of 77 GB. I have a job in the data pipeline that I am trying to optimize for which I have a very hard time to get performant with Blobfuse. What it does is the following:
2.1. Reads other metadata with 1 read request
2.2. Reads a 0.1-3 MB chunk corresponding to a number of rows in the image with 1 read request
2.3 Closes the file
2.4. In the background it does some calculations on the data and creates new small files. It actually tries to hide the latency of the file read but it is optimized for low latencies.
For each file in parallel, it first reads file metadata in a header inside the file and then it does a big continuous read of a number of rows of the image corresponding to a chunk which is 0.1-3 MB in size.
I'm testing this on a Standard NC64as T4 v3 (64 vCores, 32000 Mbps network bandwidth, 440 GB RAM) but will most likely later run it on a Standard NC16as T4 v3.
The legacy code starts with a Step 0 which downloads the whole dataset with AzCopy and then get the following performance:
Step: 0: 60 s
Step 1: 2 s
Step 2: 15 s
Now I am trying to minimize the download time as a large part of the files aren't used (I have a hard time reorganizing the input data as the files are created one at a time). So I'm trying the same with Blobfuse and I'm not getting acceptable performance. I'm most
I am first trying with the stream config and I get:
Step 1: 21 s
Step 2: 53 s
If I run it again (with the cache populated) I get:
Step 1: 9 s
Step 2: 28 s
It might be that I can optimize the parallelism of the Step 1 but what surprises me is that the second time I run it when all the data is cached in RAM, it is significantly slower than when the legacy code is reading the data from disk.
I then tried Blobfuse with file cache to debug and I get the first time:
Step 1: 54 s
Step 2: 27 s
If I run it again (with the cache populated) I get:
Step 1: 2 s
Step 2: 27 s
I've tried with different combinations of block sizes and max-concurrency but haven't found anything that gives significantly better performance.
So with streaming I get good performance for Step 1 the second time but Step 2 (which reads a lot of data compared to step 1) still takes a long time compared to reading directly from disk.
So my first question is: Why does it take longer to read from cache (RAM or disk cache) than directly from disk? Can anything be done regarding it.
**Then my second question is, how should I configure the streaming option to be performant when the cache is not populated yet. Especially for step 2.2 which reads the big chunk from each file as I later on can optimize away step 1 and 2.1 which reads the metadata. Now it is actually not faster than downloading the whole dateset. **
Your decision matrix recommends file cache instead of the streaming but does not match my use case as I read a small part of each file and I want to cache a lot of datasets which I couldn't do if I were caching all the data. I would also want to try the block cache later on also when you add write support as I see you have a PR regarding that.
Beta Was this translation helpful? Give feedback.
All reactions