Low performance #1304

tomasgiden · 2024-01-04T15:30:37Z

tomasgiden
Jan 4, 2024

I am testing blobfuse for use in a data pipeline. The datasets consist of 2400 tiff-files which are 32 MB in size for a total of 77 GB. I have a job in the data pipeline that I am trying to optimize for which I have a very hard time to get performant with Blobfuse. What it does is the following:

Open each tiff-file and read metadata in python (Tifffile in ThreadPoolExecutor), then closes them
3rd party C++ program opens each tiff-file in parallel (probably 1 threads per vCPU) and:
2.1. Reads other metadata with 1 read request
2.2. Reads a 0.1-3 MB chunk corresponding to a number of rows in the image with 1 read request
2.3 Closes the file
2.4. In the background it does some calculations on the data and creates new small files. It actually tries to hide the latency of the file read but it is optimized for low latencies.
Upload the new files

For each file in parallel, it first reads file metadata in a header inside the file and then it does a big continuous read of a number of rows of the image corresponding to a chunk which is 0.1-3 MB in size.

I'm testing this on a Standard NC64as T4 v3 (64 vCores, 32000 Mbps network bandwidth, 440 GB RAM) but will most likely later run it on a Standard NC16as T4 v3.

The legacy code starts with a Step 0 which downloads the whole dataset with AzCopy and then get the following performance:
Step: 0: 60 s
Step 1: 2 s
Step 2: 15 s

Now I am trying to minimize the download time as a large part of the files aren't used (I have a hard time reorganizing the input data as the files are created one at a time). So I'm trying the same with Blobfuse and I'm not getting acceptable performance. I'm most

I am first trying with the stream config and I get:
Step 1: 21 s
Step 2: 53 s

If I run it again (with the cache populated) I get:
Step 1: 9 s
Step 2: 28 s

It might be that I can optimize the parallelism of the Step 1 but what surprises me is that the second time I run it when all the data is cached in RAM, it is significantly slower than when the legacy code is reading the data from disk.

I then tried Blobfuse with file cache to debug and I get the first time:
Step 1: 54 s
Step 2: 27 s

If I run it again (with the cache populated) I get:
Step 1: 2 s
Step 2: 27 s

I've tried with different combinations of block sizes and max-concurrency but haven't found anything that gives significantly better performance.

So with streaming I get good performance for Step 1 the second time but Step 2 (which reads a lot of data compared to step 1) still takes a long time compared to reading directly from disk.

So my first question is: Why does it take longer to read from cache (RAM or disk cache) than directly from disk? Can anything be done regarding it.

**Then my second question is, how should I configure the streaming option to be performant when the cache is not populated yet. Especially for step 2.2 which reads the big chunk from each file as I later on can optimize away step 1 and 2.1 which reads the metadata. Now it is actually not faster than downloading the whole dateset. **

Your decision matrix recommends file cache instead of the streaming but does not match my use case as I read a small part of each file and I want to cache a lot of datasets which I couldn't do if I were caching all the data. I would also want to try the block cache later on also when you add write support as I see you have a PR regarding that.

vibhansa-msft · 2024-01-04T16:48:13Z

vibhansa-msft
Jan 4, 2024
Maintainer

As your workflow involves opening a file, reading a block and then closing it following by another set of open-read-close, streaming is not going to help you any bit here. In case of streaming once you close the file, whole data is purged and next open is going to redownload the file contents. File-cache is the best option you have here. In case of file-cache ensure you have timeout value set based on how long your application takes to open the file again. If a file times out before it then it will be deleted and next open will trigger file download again. In case your workflow is about reading all the files header first and then opening files again to process then it means you can get advantage of file-cache only if all the files are there in local cache, which technically means downloading all the files by opening them first and having enough disk space to accommodate all files. This is why in case of AzCopy you see better performance. If for any reason blobfuse runs into redownloading of file, then performance will drop. To take advantage of file-cache your application shall open a file read header, close it. Again, open it immediately and start processing so that file is not downloaded again when you start processing.

2 replies

tomasgiden Jan 4, 2024
Author

Thanks for the reply.

You say that "once you close the file, whole data is purged" but that is not what I see. Look again at my experiment:

"
I am first trying with the stream config and I get:
Step 1: 21 s
Step 2: 53 s

If I run it again (with the cache populated) I get:
Step 1: 9 s
Step 2: 28 s
"
So basically the first time I read the metadata in python it took 21 s and the next time I ran it again it took 9 s. But you say that whole data is purged. Can you explain why I am getting some benefit anyway? Is it some kernel cache that still caches data even though blobfuse has purged it.

If I were to use the new block cache, it wouldn't purge after closing a file, right?

For the file cache, I have set the timeouts to longer than my experiment. I've also checked the health monitor and there is no purging and redownloads happening. So back to my first question, why does it take longer to read from cache than directly from disk? Basically an increase from 15 to 27 seconds.

Below is my config file for file caching used for my experiment. Maybe I have made a misconfiguration.

allow-other: true

logging:
  type: syslog
  level: log_debug

components:
  - libfuse
  - file_cache
  - attr_cache
  - azstorage

libfuse:
  attribute-expiration-sec: 1200
  entry-expiration-sec: 1200
  negative-entry-expiration-sec: 2400
  max-fuse-threads: 256 # <number of threads allowed at libfuse layer for highly parallel operations, Default is 128>
  direct-io: true # I've tried with both true and false here

file_cache:
  path: /mnt/blobcache
  timeout-sec: 36000
  max-size-mb: 409600

attr_cache:
  timeout-sec: 7200

azstorage:
  type: block
  account-name: XXX
  account-key: XXX
  endpoint: XXX
  mode: key
  container: datasets
  virtual-directory: true


health_monitor:
  enable-monitoring: true
  stats-poll-interval-sec: 5
  process-monitor-interval-sec: 5

vibhansa-msft Jan 5, 2024
Maintainer

Try using this config (it will reduce some additional calls)

allow-other: true

logging:
type: syslog

components:

libfuse
file_cache
attr_cache
azstorage

libfuse:
attribute-expiration-sec: 120
entry-expiration-sec: 120
negative-entry-expiration-sec: 240

file_cache:
path: /mnt/blobcache
timeout-sec: 36000
max-size-mb: 409600

attr_cache:
timeout-sec: 7200

azstorage:
type: block
account-name: XXX
account-key: XXX
endpoint: XXX
mode: key
container: datasets

For the caching part with streaming, if you are observing that data is read faster in the second iteration then its due to kernel page cache and not due to blobfuse caching anything. Kernel cache we do not have any control over, so sometime it may serve from cache and sometime it may ask blobfuse to supply data back.
As you have configured a large file-cache timeout, be aware that files will reside in local file system for that long. If you have large number of files or large volume of data this will occupy your local disk space. In case your allocated disk space is 80% full then blobfuse will start evicting the files even before the timeout as well to conserve the disk space.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low performance #1304

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Low performance #1304

tomasgiden Jan 4, 2024

Replies: 1 comment · 2 replies

vibhansa-msft Jan 4, 2024 Maintainer

tomasgiden Jan 4, 2024 Author

vibhansa-msft Jan 5, 2024 Maintainer

tomasgiden
Jan 4, 2024

Replies: 1 comment 2 replies

vibhansa-msft
Jan 4, 2024
Maintainer

tomasgiden Jan 4, 2024
Author

vibhansa-msft Jan 5, 2024
Maintainer