-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with multithreaded reading/writing via block_cache #1426
Comments
How big is the file that you are trying to upload (.bak file which failed) ? |
Also, you have enabled disk-based caching of blocks. Do you really need this as this is going to consume space on your local disk. If your workflow is to upload the data and then read it back, then you shall continue having this but if job is to only upload the data, then you can get rid of disk-based caching completely. |
Hello! Thank you for your reply. This .bak file is about 3.6 TB, but I have already transferred 5 files of 3.6 TB before and it was successful. Did I understand correctly that here I need to change the block-size to more than 64mb? Regarding the second question, our process is that we make database backups directly to the storage via blogger and restore them from there. |
With 64MB block size you can upload at max 3.05TB file only. If you file sizes go beyond this then you need to bump up the block-size. You can try with 128MB which allows you to go beyond 4TB. |
Understood, thank you! Unfortunately, the tests are bad with 128 MB, but I found the article ‘About block blobs’ and now I understand that there is still room for improvement. I will test it with 256+ block size and report back with the results, maybe someone from our community will be interested. https://learn.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs#about-block-blobs Could you please clarify the following questions?
|
|
Did the above config changes help in uploading the file? |
Hello, Vikas! Here is my configuration (it is used for parallel 15-18 Write/Read operations with files from 250Mb to 2.7Tb):
Here is a graph to visualise the test with these settings I also ran into a strange problem, my jobs have fallen with an error: The block list may not contain more than 50,000 blocks. Without going off topic, I would like to ask you if I understand architecture correctly.
If the structure I described above is correct, then the question arises again how we get the error "The block list may not contain more than 50,000 blocks." when transferring a file that is not even more than 64Mb * 50 0000 = 3,2Tb There is also a separate, very important question. When scaling horizontally (adding a more VMs with Blobfuse2), can we hit any limits on data transfer to the same Storage Account? From what I found in the article Storage account limits, there are default maximum ingress/egress traffic (60 Gbps/120 Gbps) to the Storage Account. At first glance, I don't see any more critical restrictions. I would be truly grateful for any additional information. I apologize for the possibly obvious questions, but it's critical for me to understand this topic for proper use. |
|
Thank you for your answers!
TeamCity logs when it happand:
My config (VM 16CPU 128RAM):
I'm not sure, but it might be useful to see the load graphs. After a failed backup, I interrupted the process and repeated with different configurations. UPD |
Hello! In this case, three cases with 64, 128 and 256 block-size-mb were recorded. We ran 8 database backups at the same time.
My config (Standard E8d v5 (8 vcpus, 64 GiB memory)):
I am attaching the logs of three scenarios. |
This is not due to large block-size or 50K blocks exhaustion. Sound something different and may be from backend its failing for some other reason. Is there any sort of storage account Metrix pointing any specific error. |
Is this the only error you are hitting or there is something else. Due to the long messages and all I might have overlooked some part so just wanted to understand the real issue you are facing here. |
Any updates on this issue ? |
Can you share more details on why backup compression and block-cache can not be combined? |
I'm not ruling out that I could be wrong, but it looks like this... |
Truncation of a file to larger or smaller offset is something that we support provided that file was closed and uploaded previously. If application tries to truncate a file in transition then it might lead to a failure. |
Unfortunately, tests strongly suggest that it does not work with MSSQL compression. I turned off the compression and run tests, and again problems, with MSSQL so far ok, but not with Postgresql :(
Debug logs: |
Update. Yesterday's issue was resolved after changing the ACL rules to a directory in the storage via Azure Storage Explorer. This is very strange since the rules were set correctly anyway, we will continue to monitor. Another critical issue is the network share crash. This problem has been haunting us since the very beginning of testing, and so far, I have not been able to understand what is causing it. We were running 6 backups in parallel. After about 5 minutes, the network share crashed. These are the logs we see in the last seconds before the network share became unavailable.
|
By network share crash you mean blobfuse mount path becomes unstable after some time? When you say 5 backups running in parallel are they running for the same path or you have 5 different mounts? Also, if crash is consistent you can use "--foreground" cli option and when it crashes it will dump some logs on console, kindly share those. In foreground mode mount will block the console so you have to run backup or other things from a different console. |
Hi, just wondering if these logs helped you in any way? I'm worried if this can be fixed quickly at all. If there's anything else, I'm ready for shamanic dances and prayers :) |
Hi @olegshv. Can you also please share the debug logs for the above crash? |
@souravgupta-msft hi! I would say that in this configuration, every launch will crash. Depending on the configuration parameters, the crash can occur at different times. I'll record some more logs for you. |
We have fixed this in our feature branch. Next release of Blobfuse will have the fix. You can expect release early next week. |
Hi @vibhansa-msft ! If block-size-mb: 128 we have an error (But this is strange, because 128*50000 = 6400000, and our file is almost twice as small)
If block-size-mb: 256 or more we have another error. Here is an example of the configuration: Here are the logs: Here are more logs with experiments. I have the impression that I just didn't consider something, but at first glance, everything is fine. |
@syeleti-msft kindly take a look in this. |
@syeleti-msft, @vibhansa-msft I'm sorry to bother you, but do you have any updates? Is there anything I can do to help? |
Thanks for sharing the logs, we are able reproduce this issue and working on fixes. We will update you once fixes are ready. |
We are planning to release by end of Nov-24. |
Changset is merged to main branch. Next release of Blobfuse will have the fix. |
Which version of blobfuse was used?
blobfuse2 version 2.3.0
Which OS distribution and version are you using?
Debian GNU/Linux 11 (bullseye)
If relevant, please share your mount command.
blobfuse2 mount /opt/blobfuse/fuse_container/ --config-file=/opt/blobfuse/fuse_connection.yaml
My fuse_connection.yaml file
What was the issue encountered?
A brief description of what we do with blogfusion2:
We are currently testing the transition on stream or block_cache. After a few days of testing, the stream was abandoned altogether, as we never managed to successfully back up. block_cache gives us hope in our hearts :)
For example, here is a 3.6TB database backup running in 5 threads in parallel. It's ok.
But here's an example of 15-thread jobs, where there are small databases and one large one, the load is not high, and it failed.
There is no load, but we get an error.
Unfortunately, after reading a lot of information and documentation, I did not find an answer on scaling and configurations for large loads. We ran tests even with large VMs (32cpu, 256ram), but it didn't work.
I would like to get some recommendations on how to solve this case in the most reliable way. Which settings are better to use in block_cache: for such tasks?
Have you found a mitigation/solution?
No
Please share logs if available.
RESPONSE 404: 404 The specified blob does not exist.
ERROR CODE: BlobNotFound
Response contained no body
BlockListTooLong
The block list may not contain more than 50,000 blocks.RequestId:3a61884d-201e-0005-6461-b863c1000000
Time:2024-06-06T22:31:44.8071145Z
The text was updated successfully, but these errors were encountered: