Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with multithreaded reading/writing via block_cache #1426

Closed
olegshv opened this issue Jun 6, 2024 · 34 comments · Fixed by #1503
Closed

Problems with multithreaded reading/writing via block_cache #1426

olegshv opened this issue Jun 6, 2024 · 34 comments · Fixed by #1503

Comments

@olegshv
Copy link

olegshv commented Jun 6, 2024

Which version of blobfuse was used?

blobfuse2 version 2.3.0

Which OS distribution and version are you using?

Debian GNU/Linux 11 (bullseye)

If relevant, please share your mount command.

blobfuse2 mount /opt/blobfuse/fuse_container/ --config-file=/opt/blobfuse/fuse_connection.yaml

My fuse_connection.yaml file

image

What was the issue encountered?

A brief description of what we do with blogfusion2:

  • We use it to make DB backups directly to the Storage Account and restore from it. Usually, we run backups of 10-15 threads in parallel (we would like to have more in the future, but we need to understand how to scale). Database sizes range from 200mb to 4TB.
  • Now we use Azure instance with size Standard L8as v3 (8 vcpus, 64 GiB memory).
  • Previously, we used file_cache, but now we have a critical problem with the disk space for the cache, as the total amount of backups running in one thread can take up to 12-15TB, and it becomes very unreliable.

We are currently testing the transition on stream or block_cache. After a few days of testing, the stream was abandoned altogether, as we never managed to successfully back up. block_cache gives us hope in our hearts :)

For example, here is a 3.6TB database backup running in 5 threads in parallel. It's ok.
image
image

But here's an example of 15-thread jobs, where there are small databases and one large one, the load is not high, and it failed.
image
There is no load, but we get an error.

  1. From Windows
    image
  2. An error from SQL Server Management Studio: 1117(The request could not be performed because of an I/O device error.)

Unfortunately, after reading a lot of information and documentation, I did not find an answer on scaling and configurations for large loads. We ran tests even with large VMs (32cpu, 256ram), but it didn't work.
I would like to get some recommendations on how to solve this case in the most reliable way. Which settings are better to use in block_cache: for such tasks?

Have you found a mitigation/solution?

No

Please share logs if available.

  1. In log_warning we can see
    image
  2. In log_debug:

RESPONSE 404: 404 The specified blob does not exist.
ERROR CODE: BlobNotFound

Response contained no body


BlockListTooLongThe block list may not contain more than 50,000 blocks.
RequestId:3a61884d-201e-0005-6461-b863c1000000
Time:2024-06-06T22:31:44.8071145Z

@vibhansa-msft vibhansa-msft self-assigned this Jun 7, 2024
@vibhansa-msft vibhansa-msft added this to the V2-2.3.1 milestone Jun 7, 2024
@vibhansa-msft
Copy link
Member

How big is the file that you are trying to upload (.bak file which failed) ?
Azure storage backend supports a max of 50K blocks per blob while each individual block can range from few bytes to 4GB in size. As per you config you have set block-size to 64MB. You can do the math here, if the file size divided by 64MB goes beyond 50K then its bound to fail and only workaround here is to update the block size to a point where your file fits in 50K blocks.

@vibhansa-msft
Copy link
Member

Also, you have enabled disk-based caching of blocks. Do you really need this as this is going to consume space on your local disk. If your workflow is to upload the data and then read it back, then you shall continue having this but if job is to only upload the data, then you can get rid of disk-based caching completely.

@olegshv
Copy link
Author

olegshv commented Jun 7, 2024

How big is the file that you are trying to upload (.bak file which failed) ? Azure storage backend supports a max of 50K blocks per blob while each individual block can range from few bytes to 4GB in size. As per you config you have set block-size to 64MB. You can do the math here, if the file size divided by 64MB goes beyond 50K then its bound to fail and only workaround here is to update the block size to a point where your file fits in 50K blocks.

Hello! Thank you for your reply. This .bak file is about 3.6 TB, but I have already transferred 5 files of 3.6 TB before and it was successful. Did I understand correctly that here I need to change the block-size to more than 64mb?

Regarding the second question, our process is that we make database backups directly to the storage via blogger and restore them from there.
We are still testing the best and fastest way to do this. Therefore, I would be very grateful for your advice on how best to proceed in this case. Let me remind you that files can be larger than 4 TB.

@vibhansa-msft
Copy link
Member

With 64MB block size you can upload at max 3.05TB file only. If you file sizes go beyond this then you need to bump up the block-size. You can try with 128MB which allows you to go beyond 4TB.
If backup and restore are done on the same box then disk cache will help. But considering the size of your files, I beleive it's best to disable the disk cache completly (assuming you do not have 4TB free disk space to store all of the contents).

@olegshv
Copy link
Author

olegshv commented Jun 7, 2024

With 64MB block size you can upload at max 3.05TB file only. If you file sizes go beyond this then you need to bump up the block-size. You can try with 128MB which allows you to go beyond 4TB. If backup and restore are done on the same box then disk cache will help. But considering the size of your files, I beleive it's best to disable the disk cache completly (assuming you do not have 4TB free disk space to store all of the contents).

Understood, thank you! Unfortunately, the tests are bad with 128 MB, but I found the article ‘About block blobs’ and now I understand that there is still room for improvement. I will test it with 256+ block size and report back with the results, maybe someone from our community will be interested.

https://learn.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs#about-block-blobs
image

Could you please clarify the following questions?

  1. Do I need to add another parameter "block-size-mb:" to the "azstorage:" in my case ?
  2. If I understood and calculated correctly, then in our case, with a block size of 256 MB and simultaneous reading/writing of 15 files, the prefetch parameter should be set to a smaller value, for example 12? 2561512 = 46080 (which is less than our allocated RAM) How exactly does this parameter affect the speed and reliability of data transfer?
  3. And would it be possible to leave the parallelism parameter in the standard 128 state? If my logic is correct, then increasing this parameter simply increases the load on the CPU and potentially increases the data transfer rate over the network interface.

@vibhansa-msft
Copy link
Member

  • If you are using block-cache then you don't need to worry about block-size in "azstorage"
  • block-size defines size of one buffer and prefetch defines number of such blocks to be prefetched for each file handle that you open. If you application opens or works on 12 files in parallel then basically you need memory for (blocksize * prefetch * number of files). If you do not have that much of memory then you need to lower down the prefetch count. However there is a problem here because our algorithm will not allow you to take it lower then 10 or 12. If block sizes are very big then there will be some impact on the performance as well.
  • parallelism shall depend on the number of vcpus you have. If you have around 40 vcpus available then keeping a count of 128 is just fine. (Generally I go with vcpu * 3 as optimal config)

@vibhansa-msft
Copy link
Member

Did the above config changes help in uploading the file?

@olegshv
Copy link
Author

olegshv commented Jun 12, 2024

Hello, Vikas!
During these days, I conducted many experiments and found the stable configuration for our workloads.

Here is my configuration (it is used for parallel 15-18 Write/Read operations with files from 250Mb to 2.7Tb):

block_cache:
  block-size-mb: 256
  mem-size-mb: 56320
  prefetch: 20
  parallelism: 128 

Here is a graph to visualise the test with these settings
image

I also ran into a strange problem, my jobs have fallen with an error: The block list may not contain more than 50,000 blocks.
This is even though, according to the formula "block-size-mb * 50 000" I didn't even come close to the limit.
I would like to ask, after restarting the service and starting all the backup jobs again, could that list continue to be filled and reach the 50,000 blocks limit (or maybe all the lists remained in RAM and just continued to be recorded + new ones were created, or if the file name the same the list will continue to be filled)? Please help me to clarify the reason for this.

Without going off topic, I would like to ask you if I understand architecture correctly.

  • Block blob is a file. The same file that we can see in the container.
  • Each file (Block blob), when it in our case (backuping/restoring), consists of blocks and can be up to 50 000.
  • Each block (a part of Block blob), when it is moved (uploaded/downloaded), is written to the list (Put Block), which then it commits the blocks to the Block blob (Put Block List).

If the structure I described above is correct, then the question arises again how we get the error "The block list may not contain more than 50,000 blocks." when transferring a file that is not even more than 64Mb * 50 0000 = 3,2Tb
image

There is also a separate, very important question. When scaling horizontally (adding a more VMs with Blobfuse2), can we hit any limits on data transfer to the same Storage Account? From what I found in the article Storage account limits, there are default maximum ingress/egress traffic (60 Gbps/120 Gbps) to the Storage Account. At first glance, I don't see any more critical restrictions. I would be truly grateful for any additional information.

I apologize for the possibly obvious questions, but it's critical for me to understand this topic for proper use.
Thank you in advance!

@vibhansa-msft
Copy link
Member

  • Your understanding on blob and file and 50K blocks is perfect. Each block blob is a file that you see in container and it can have a max of 50K blocks.
  • Blocks are staged (uploaded) using put block and committed using putBlockList. This putBlockList call consists of a list of block-id that have been uploaded earlier and shall not exceed 50K.
  • There are two cases where you can get the 50K limit hit
    - PutBlockList contains more than 50K blocks.
    - Earlier putBlockList failed due to same or any other reasons and there are many uncommitted blocks left in the backed. These blocks are wiped out after a timeout (typically 1-2 days). During this time period if you try to upload a file with the same name the backend tries to add the new incoming blocks to existing block list and if uncomitted block list (which consists of earlier failed blocks and new blocks) grows beyond 50K mark then write operations will start to fail.
  • When you try to horizontally scale the system, each instance of blobfuse will hit the same account. From backend storage you will have some throughput and IOPS limit on the account and if you generate too much of ingress or egress traffic these limits are likely to be hit.
  • Also, in such scale case you shall be mindful that two instances of blobfuse (or even the same instance for that matter) shall not have multiple writers for the same file. This will lead to data corruption if there are multiple instances trying to upload the same file.
  • Your understanding on the storage account limits for the bandwidth is correct.

@olegshv
Copy link
Author

olegshv commented Jun 12, 2024

Thank you for your answers!
Today, I set up blobfuse2 on one of the productive servers.
This time, the behavior of the blobfuse2 was interesting, errors occurred, and the network share continued to work. After a while, it fell off. Please tell me what could be the reason and how to fix it?
Blobfuse logs:

Wed Jun 12 22:54:51 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [azstorage.go (458)]: AzStorage::ReadInBuffer : Failed to read storage/backup/site1-Dump.backup.backup [numerical result out of range]
Wed Jun 12 22:54:51 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (847)]: BlockCache::download : 3 attempts to download a block have failed 378=>storage/backup/site1-Dump.backup.backup (index 0, offset 0)
Wed Jun 12 22:54:51 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (580)]: BlockCache::getBlock : Failed to download block 378=>storage/backup/site1-Dump.backup.backup (offset 0, index 0)
Wed Jun 12 22:54:51 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (493)]: BlockCache::ReadInBuffer : Failed to get Block 378=>storage/backup/site1-Dump.backup.backup offset 0 [failed to download block]
Wed Jun 12 22:54:51 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (764)]: Libfuse::libfuse_read : error reading file storage/backup/site1-Dump.backup.backup, handle: 378 [failed to download block]
Wed Jun 12 22:56:36 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_blob.go (1462)]: BlockBlob::CommitBlocks : Failed to commit block list to blob storage/backup/mssql-site-diff.bak [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidInput</Code><Message>One of the request inputs is not valid.
RequestId:e9f99714-001e-0050-3502-bd194d000000
Time:2024-06-12T19:56:36.1653073Z</Message></Error>
--------------------------------------------------------------------------------
Wed Jun 12 22:56:55 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_blob.go (1462)]: BlockBlob::CommitBlocks : Failed to commit block list to blob storage/backup/mssql-site-diff.bak [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidInput</Code><Message>One of the request inputs is not valid.
RequestId:c077b8a3-001e-0032-5902-bddb6a000000
Time:2024-06-12T19:56:55.9598214Z</Message></Error>
--------------------------------------------------------------------------------

Wed Jun 12 22:57:19 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (439)]: BlockCache::CloseFile : failed to flush file storage/backup/mssql-site-diff.bak
Wed Jun 12 22:57:19 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (867)]: Libfuse::libfuse_release : error closing file storage/backup/mssql-site-diff.bak, handle: 2036 [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidInput</Code><Message>One of the request inputs is not valid.
RequestId:bf187da8-301e-0074-4b02-bdefed000000
Time:2024-06-12T19:57:19.7806094Z</Message></Error>
--------------------------------------------------------------------------------
]
Wed Jun 12 22:57:31 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_blob.go (1462)]: BlockBlob::CommitBlocks : Failed to commit block list to blob Cstorage/backup/mssql-site2-diff.bak [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------
]
Wed Jun 12 22:57:31 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (409)]: BlockCache::FlushFile : Failed to commit blocks for storage/backup/mssql-site2-diff.bak [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------
]
Wed Jun 12 22:57:31 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (1386)]: BlockCache::SyncFile : failed to flush file storage/backup/mssql-site2-diff.bak
Wed Jun 12 22:57:31 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (1062)]: Libfuse::libfuse_fsync : error syncing file storage/backup/mssql-site2-diff.bak [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------
]
Wed Jun 12 22:57:54 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (815)]: Libfuse::libfuse_flush : error flushing file /storage/backup/mssql-site2-diff.bak, handle: 2066 [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------
]
Wed Jun 12 23:00:12 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (1229)]: BlockCache::commitBlocks : Failed to commit blocks for storage/backup/mssql-site2-diff.bak [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------

TeamCity logs when it happand:

MSSQL error: The operating system returned the error '1117(The request could not be performed because of an I/O device error.)' while attempting 'FlushFileBuffers' on '\\my-blobfuse\backup\mssql-site2-diff.bak'.
22:56:55               BACKUP DATABASE is terminating abnormally.

23:06:27           pg_dump: error: could not write to output file: Input/output error
23:06:31   Error message: pg_dump: error: could not write to output file: Input/output error

My config (VM 16CPU 128RAM):

allow-other: true

logging:
    type: base
    level: log_warning
    file-path: '/var/log/blobfuse2/blobfuse2.log'

components:
    - libfuse
    - block_cache
    - attr_cache
    - azstorage

libfuse:
   default-permission: 0777
   attribute-expiration-sec: 240
   entry-expiration-sec: 240
   negative-entry-expiration-sec: 240

attr_cache:
   timeout-sec: 3600

azstorage:
    type: adls
    account-name: *****
    account-key: *****
    mode: key
    container: ******
    tier: cool

block_cache:
  block-size-mb: 256
  mem-size-mb: 112640
  prefetch: 20
  parallelism: 128
  • more logs with errors
    image

I'm not sure, but it might be useful to see the load graphs. After a failed backup, I interrupted the process and repeated with different configurations.
image

UPD
I have added a new debug log, maybe it can help
blobfuse2-new.zip

@olegshv
Copy link
Author

olegshv commented Jun 17, 2024

Hello!
I've spent all these days testing the problem. Unfortunately, the issues do not disappear with different configurations. To better structure everything, I wrote down the debug logs in the "minimal" configuration in the hope that this will help you solve the problem.

In this case, three cases with 64, 128 and 256 block-size-mb were recorded. We ran 8 database backups at the same time.
With block-size-mb 64 and 128, the network share crashed. With 256 block-size-mb the network share was working, but the backup of MSSQL databases was down.

The operating system returned the error '1117(The request could not be performed because of an I/O device error.)' while attempting 'FlushFileBuffers' on '\\mystorage\mystorage\Client\BackupFolder\site2\site2-db-20240617-175314-812-full.bak'.
BACKUP DATABASE is terminating abnormally.
Sizes of database backups:
site1 2.3Tb
site2 166Gb
site3 1.3Gb
site4 24Gb
site5 642Mb
site6 246Gb
site7 374Mb
site8 236Mb
site9 253Gb

My config (Standard E8d v5 (8 vcpus, 64 GiB memory)):

logging:
    type: base
    level: log_debug
    file-path: '/var/log/blobfuse2/blobfuse2.log'

allow-other: true

azstorage:
    type: adls
    account-name: ******
    account-key: ******
    mode: key
    container: ******
    tier: cool
    max-retries: 10

components:
    - libfuse
    - block_cache
    - attr_cache
    - azstorage

health_monitor:
    enable-monitoring: true
    stats-poll-interval-sec: 30
    output-path: /var/log/blobfuse2

monitor-disable-list:
    - memory_profiler
    - cpu_profiler
    - file_cache_monitor

libfuse:
  attribute-expiration-sec: 240
  entry-expiration-sec: 240
  negative-entry-expiration-sec: 240
..
block_cache:
  block-size-mb:. 64 ... 128 ... 256
  mem-size-mb: 56320
  prefetch: 12

attr_cache:
  timeout-sec: 7200

I am attaching the logs of three scenarios.
If you need to test other scenarios, we are ready to help in any way we can. Thank you!

  1. blobfuse2-test (64-12).zip
  2. blobfuse2-test (128-12).zip
  3. blobfuse2-test (256-12)2.zip

@vibhansa-msft
Copy link
Member

Wed Jun 12 22:57:54 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (815)]: Libfuse::libfuse_flush : error flushing file /storage/backup/mssql-site2-diff.bak, handle: 2066 [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------

This is not due to large block-size or 50K blocks exhaustion. Sound something different and may be from backend its failing for some other reason. Is there any sort of storage account Metrix pointing any specific error.

@vibhansa-msft
Copy link
Member

Is this the only error you are hitting or there is something else. Due to the long messages and all I might have overlooked some part so just wanted to understand the real issue you are facing here.

@vibhansa-msft
Copy link
Member

Any updates on this issue ?

@olegshv
Copy link
Author

olegshv commented Jun 26, 2024

Hello! I apologise for the long silence, I was on holiday. Unfortunately, we didn't have StorageAccount logging set up.

Today I created a Log Analytics workspace and ran backups for a new test in the configuration (block-size-mb: 128, mem-size-mb: 56320, prefetch: 12).
We see the following picture:
image
image

Please tell me what kind of logs you need for analysis.

As for your question about the types of errors in the logs.

Wed Jun 12 22:57:54 EEST 2024 : blobfuse2[1215] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (815)]: Libfuse::libfuse_flush : error flushing file /storage/backup/mssql-site2-diff.bak, handle: 2066 [PUT https://backupStorage.blob.core.windows.net/storage/backup/mssql-site2-diff.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 One of the request inputs is not valid.
ERROR CODE: InvalidInput
--------------------------------------------------------------------------------

This is not due to large block-size or 50K blocks exhaustion. Sound something different and may be from backend its failing for some other reason. Is there any sort of storage account Metrix pointing any specific error.

Here are more examples of errors that we see in the logs:

Mon Jun 17 13:46:43 EEST 2024 : blobfuse2[5173] : [/opt/blobfuse/fuse_container] LOG_ERR [block_blob.go (1331)]: BlockBlob::StageAndCommit : Failed to stage to blob Client/BackupFolder/site5/site5-db-20240617-104510-809-full.bak with ID OTVhYTgxOGItZjAxMS00YjQ0LTcyYWYtMTcwYzBiYWM3MGE3 at block 654311424 [PUT https://mystorage.blob.core.windows.net/mystorage/Client/BackupFolder/site5/site5-db-20240617-104510-809-full.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 The specified blob or block content is invalid.
ERROR CODE: InvalidBlobOrBlock
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidBlobOrBlock</Code><Message>The specified blob or block content is invalid.
RequestId:c4597294-801e-003c-43a3-c0f2da000000
Time:2024-06-17T10:46:43.8223452Z</Message></Error>
--------------------------------------------------------------------------------
]
Mon Jun 17 13:46:43 EEST 2024 : blobfuse2[5173] : [/opt/blobfuse/fuse_container] LOG_ERR [block_blob.go (1160)]: BlockBlob::TruncateFile : Failed to stage and commit file Client/BackupFolder/site5/site5-db-20240617-104510-809-full.bak%!(EXTRA string=PUT https://mystorage.blob.core.windows.net/mystorage/Client/BackupFolder/site5/site5-db-20240617-104510-809-full.bak
--------------------------------------------------------------------------------
RESPONSE 400: 400 The specified blob or block content is invalid.
ERROR CODE: InvalidBlobOrBlock
--------------------------------------------------------------------------------
Mon Jun 17 13:45:18 EEST 2024 : blobfuse2[5173] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(ResponseError) : HEAD https://mystorage.blob.core.windows.net/mystorage/Client/BackupFolder/site7
--------------------------------------------------------------------------------
RESPONSE 404: 404 The specified blob does not exist.
ERROR CODE: BlobNotFound
--------------------------------------------------------------------------------
Mon Jun 17 16:35:39 EEST 2024 : blobfuse2[17024] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(ResponseError) : GET https://mystorage.blob.core.windows.net/mystorage/Client/BackupFolder/site9/site9-db-20240617-104615-7185-v16-Dump.backup
--------------------------------------------------------------------------------
RESPONSE 416: 416 The range specified is invalid for the current size of the resource.
ERROR CODE: InvalidRange
--------------------------------------------------------------------------------

@olegshv
Copy link
Author

olegshv commented Jun 27, 2024

Hi!
Finally, we got to the bottom of the cause of the "FlushFileBuffers" error: it is caused by MSSQL backup COMPRESSION.
And as far as I understand, it is impossible to combine block_cache and backup compression. We will continue testing. Thank you!

image
image

@vibhansa-msft
Copy link
Member

Can you share more details on why backup compression and block-cache can not be combined?
If the data is compressed and then stored as a file to storage using block-cache it shall work just fine. If the data is written to a file and then compression is applied on that data (which is now residing in cloud) then there is a huge cost of downloading the data agian and then compressing it and reuploading the compressed data. What is the exact use-case here?

@olegshv
Copy link
Author

olegshv commented Jun 27, 2024

@vibhansa-msft
Copy link
Member

Truncation of a file to larger or smaller offset is something that we support provided that file was closed and uploaded previously. If application tries to truncate a file in transition then it might lead to a failure.

@olegshv
Copy link
Author

olegshv commented Jun 27, 2024

Unfortunately, tests strongly suggest that it does not work with MSSQL compression.

I turned off the compression and run tests, and again problems, with MSSQL so far ok, but not with Postgresql :(
pg_dump: error: could not open file "/mnt/DailyClientBackup/site1-20240627-132357/site1/site1-dump.backup": Input/output error

Thu Jun 27 16:25:57 EEST 2024 : blobfuse2[798] : [/opt/blobfuse/fuse_container] LOG_ERR [block_cache.go (357)]: BlockCache::OpenFile : Block size mismatch for Client/SQLDailyBackup/site1/site1-db-20240627-132357-813-v16-Dump.backup [block: JeHoTH2cS8dr294cbrYiIQ==, size: 24395776]
Thu Jun 27 16:25:57 EEST 2024 : blobfuse2[798] : [/opt/blobfuse/fuse_container] LOG_ERR [libfuse_handler.go (710)]: Libfuse::libfuse_open : Failed to open Client/SQLDailyBackup/site1/site1-db-20240627-132357-813-v16-Dump.backup [block size mismatch for Client/SQLDailyBackup/site1/site1-db-20240627-132357-813-v16-Dump.backup]

Debug logs:
blobfuse2log.txt
and now we have some logs from the Azure:
image

@olegshv
Copy link
Author

olegshv commented Jun 28, 2024

Update. Yesterday's issue was resolved after changing the ACL rules to a directory in the storage via Azure Storage Explorer. This is very strange since the rules were set correctly anyway, we will continue to monitor.

Another critical issue is the network share crash. This problem has been haunting us since the very beginning of testing, and so far, I have not been able to understand what is causing it.

We were running 6 backups in parallel. After about 5 minutes, the network share crashed.

image

These are the logs we see in the last seconds before the network share became unavailable.

image

image

Fri Jun 28 01:56:11 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Retry) : response 201
Fri Jun 28 01:56:11 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Retry) : exit due to non-retriable status code
Fri Jun 28 01:56:11 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:11 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:11 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [block_cache.go (1095)]: BlockCache::waitAndFreeUploadedBlocks : Block cleanup for block 731=>Client/BackupFolder/site1/site1-db-diff.bak (index 69, offset 9261023232)
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [block_cache.go (1049)]: BlockCache::lineupUpload : Upload block 731=>Client/BackupFolder/site1/site1-db-diff.bak (index 86, offset 11542724608, data 134217728)
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [block_blob.go (1422)]: BlockBlob::StageBlock : name Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Retry) : =====> Try=1
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Request) : ==> OUTGOING REQUEST (Try=1)
   PUT https://mystore.blob.core.windows.net/mycontainer/Client%2FBackupFolder%2Fsite1%2Fsite1-db-diff.bak?blockid=iQxbQx0HQPtPoJxTDstH%2BQ%3D%3D&comp=block
   Accept: application/xml
   Authorization: REDACTED
   Content-Length: 134217728
   Content-Type: application/octet-stream
   User-Agent: Azure-Storage-Fuse/2.3.0 (Debian GNU/Linux 11 (bullseye)) azsdk-go-azblob/v1.3.2 (go1.22.1; linux)
   X-Ms-Date: Thu, 27 Jun 2024 22:56:12 GMT
   x-ms-version: 2023-11-03

Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [block_cache.go (1095)]: BlockCache::waitAndFreeUploadedBlocks : Block cleanup for block 731=>Client/BackupFolder/site1/site1-db-diff.bak (index 70, offset 9395240960)
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [block_cache.go (1049)]: BlockCache::lineupUpload : Upload block 731=>Client/BackupFolder/site1/site1-db-diff.bak (index 87, offset 11676942336, data 134217728)
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [block_blob.go (1422)]: BlockBlob::StageBlock : name Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Retry) : =====> Try=1
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Request) : ==> OUTGOING REQUEST (Try=1)
   PUT https://mystore.blob.core.windows.net/mycontainer/Client%2FBackupFolder%2Fsite1%2Fsite1-db-diff.bak?blockid=T%2Ff10O7YSj1hwuzlWbIZFQ%3D%3D&comp=block
   Accept: application/xml
   Authorization: REDACTED
   Content-Length: 134217728
   Content-Type: application/octet-stream
   User-Agent: Azure-Storage-Fuse/2.3.0 (Debian GNU/Linux 11 (bullseye)) azsdk-go-azblob/v1.3.2 (go1.22.1; linux)
   X-Ms-Date: Thu, 27 Jun 2024 22:56:12 GMT
   x-ms-version: 2023-11-03

Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [libfuse_handler.go (1142)]: Libfuse::libfuse_utimens : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : Client/BackupFolder/site1/site1-db-diff.bak served from cache
Fri Jun 28 01:56:12 EEST 2024 : blobfuse2[792] : [/opt/blobfuse/fuse_container] LOG_DEBUG [utils.go (151)]: SDK(Response) : ==> REQUEST/RESPONSE (Try=1/1.989390259s, OpTime=1.989417219s) -- RESPONSE RECEIVED
   PUT https://mystore.blob.core.windows.net/mycontainer/Client%2FBackupFolder%2Fsite1%2Fsite1-db-diff.bak?blockid=M4kx5pwnSG1%2BhhGDzHZ7JQ%3D%3D&comp=block
   Accept: application/xml
   Authorization: REDACTED
   Content-Length: 134217728
   Content-Type: application/octet-stream
   User-Agent: Azure-Storage-Fuse/2.3.0 (Debian GNU/Linux 11 (bullseye)) azsdk-go-azblob/v1.3.2 (go1.22.1; linux)
   X-Ms-Date: Thu, 27 Jun 2024 22:56:10 GMT
   x-ms-version: 2023-11-03
   --------------------------------------------------------------------------------
   RESPONSE Status: 201 Created
   Content-Length: 0
   Date: Thu, 27 Jun 2024 22:56:11 GMT
   Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
   X-Ms-Content-Crc64: iS+krsn+pYw=
   X-Ms-Request-Id: 13793667-001e-0040-24e5-c8dc25000000
   X-Ms-Request-Server-Encrypted: REDACTED
   X-Ms-Version: 2023-11-03

@vibhansa-msft
Copy link
Member

By network share crash you mean blobfuse mount path becomes unstable after some time? When you say 5 backups running in parallel are they running for the same path or you have 5 different mounts? Also, if crash is consistent you can use "--foreground" cli option and when it crashes it will dump some logs on console, kindly share those. In foreground mode mount will block the console so you have to run backup or other things from a different console.

@olegshv
Copy link
Author

olegshv commented Jun 28, 2024

Yes, I mean that for some reason the mounting point disappears, while on the graph we see the RAM being cleared.
image

Run in parallel - I mean that all backups are made to a single mount point.

I added the option --foreground and this is what we have after the crash:
image

@olegshv
Copy link
Author

olegshv commented Jul 1, 2024

Hi, just wondering if these logs helped you in any way? I'm worried if this can be fixed quickly at all. If there's anything else, I'm ready for shamanic dances and prayers :)
Thank you!

@souravgupta-msft
Copy link
Member

Hi @olegshv. Can you also please share the debug logs for the above crash?
We are trying to replicate this error. Is this error happening frequently for your use case?

@olegshv
Copy link
Author

olegshv commented Jul 3, 2024

@souravgupta-msft hi!
Yes, of course. Here are the logs:
blobfuse2_updated.log

I would say that in this configuration, every launch will crash. Depending on the configuration parameters, the crash can occur at different times. I'll record some more logs for you.

upd
It fell again, with the following configurations:
image

image

@vibhansa-msft
Copy link
Member

We have fixed this in our feature branch. Next release of Blobfuse will have the fix. You can expect release early next week.

@olegshv
Copy link
Author

olegshv commented Sep 5, 2024

Hi @vibhansa-msft !
Unfortunately, I cannot backup one large database (3.5 TB). It crashes on different settings.

If block-size-mb: 128 we have an error (But this is strange, because 128*50000 = 6400000, and our file is almost twice as small)

--------------------------------------------------------------------------------
RESPONSE 400: 400 The block list may not contain more than 50,000 blocks.
ERROR CODE: BlockListTooLong
--------------------------------------------------------------------------------

If block-size-mb: 256 or more we have another error.

Here is an example of the configuration:
nnqnO7trpv

Here are the logs:
blobfuse22.log
YcaLe9qXRA

Here are more logs with experiments. I have the impression that I just didn't consider something, but at first glance, everything is fine.
blobfuse21.log
blobfuse2.log
UBTRsbWTyZ

@vibhansa-msft
Copy link
Member

@syeleti-msft kindly take a look in this.

@olegshv
Copy link
Author

olegshv commented Sep 10, 2024

@syeleti-msft, @vibhansa-msft I'm sorry to bother you, but do you have any updates? Is there anything I can do to help?

@syeleti-msft syeleti-msft reopened this Sep 11, 2024
@syeleti-msft
Copy link
Member

Thanks for sharing the logs, we are able reproduce this issue and working on fixes. We will update you once fixes are ready.

@vibhansa-msft vibhansa-msft removed their assignment Oct 12, 2024
@vibhansa-msft vibhansa-msft modified the milestones: v2-2.3.2, v2-2.4.0 Oct 12, 2024
@olegshv
Copy link
Author

olegshv commented Oct 28, 2024

Hello, I would like to inquire about the approximate timeframe when the v2.4.0 update might be available as this issue and #1547 is very critical for us.
Thank you in advance.

@vibhansa-msft
Copy link
Member

We are planning to release by end of Nov-24.

@vibhansa-msft
Copy link
Member

Changset is merged to main branch. Next release of Blobfuse will have the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants