-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oracle RMAN backup fails #1230
Comments
I tried changing the timeouts/expiration configuration options:
But I still get the error, and seems to be related to "connection reset by peer" on the storage account.
Between these two timestamps - 15:01:30 -> 15:03:15 - I get the connection errors in the blobfuse2.log:
|
This article - https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/oracle-database-backup-strategies#blobfuse - is fairly new (or updated lately) and I noticed this paragraph: "Blobfuse is ubiquitous across Azure regions and works with all storage account types, including general-purpose v1/v2 and Azure Data Lake Storage Gen2. But it doesn't perform as well as alternative protocols. For suitability as the database backup medium, we recommend using the SMB or NFS protocol to mount Azure Blob Storage." Do you have recommendations / experience with this? Is Blobfuse2 a bad use case for doing Oracle RMAN backups/restores? We have two main objectives:
Not sure 2 is possible .... |
If your backend storage is Azure Files, then Blobfuse is the not the tool to use as we do not support Files. You have removed
In the error logs that you have posted, it shows "context cancelled" which points to a situation where either operation was cancelled from application end or there was a timeout. And there are some "connection reset by peer" errors which might be pointing to backend errors mostly a transient issue from storage. Can you share what kind of authentication mode you are using as I do not see your |
Hi @vibhansa-msft . I am using key authentication. This is the "azstorage" section. I left it out.
Can I please ask you to explain in detail what the attribute caching mean? I am not sure I get that part, and also why it reduces cost? Due to less network roundtrips? |
This is the debug log when starting/mounting:
I notice this part:
The retry count, max timeout etc. In the previous log the request says:
and
|
Question: |
Attr-Cache:Here Blobfuse will cache the metadata of each blob like size, LMT etc. This will be cached in process (no disk space used) and will be returned back to kernel when it requires. Getting attributes of a file or directory is quite frequent operation in kernel (before every open/write there will be calls to this) and if caching is disabled then each of these calls will result into a listBlob api call to storage (which is charged API). Thus, having a cache in blobfuse will reduce your cost and network delays. Streaming:Streaming as a solution shall be used only when you do not have enough of disk space. For e.g., let's say you have a file of 1TB size, and you want to read/process it but do not have this much of free disk space. In such cases streaming is recommended however data is downloaded on demands, this solution will be quite slow. This is useful in workflow where either you want to deal with fairly large files, or you wish to just read some portion of a file and do not wish to download entire contents. Retry:Default configuration for retry shall be left as is unless there is a very specific use-case which requires its tuning. Here in your case its most likely network flakiness which is causing connection termination. So, you shall be looking at why the connections are not stable. Might be some sort of throttling happening at your account level which is causing backend server to close the connections. |
I have found different blog posts with people doing it - eg. https://www.bluecrystal.com.au/news/the-options-for-oracle-backup-in-azure-blobfuse/ - so wondering why we have issues. Most examples I find though seems to be using Blobfuse v1.4/1.3. You refer to unstable connection. Perhaps. Our VM and blob container are both in the same region (West Europe). (BTW, what does "LMT" mean :)) ? |
LMT : Last modified time. |
@vibhansa-msft , do you have some good references internally or externally, or examples of others doing Oracle backups using Blobfuse? |
Looking at the libfuse_handler.go it seems that there's no retry mechanism implemented. Why? |
There are layers in blobfuse, libfuse is not the one responsible for retries and all. Retries are done in azstorage component. |
So @vibhansa-msft , when seeing this:
We can't tell if it's actually retrying. Is it ? |
We need to understand why the connection is being reset from backend server here. |
I have opened a support case with Azure Support. Hoping to get some assistance. |
In your config can you enabel "sdk_trace: true" under "azstorage" section. This will start dumping the REST calls we are making and in that there will a request-id field printed. For any failed request we can use this field value to trace logs in the backend so support team can help once you provide this. |
Failed again (timeout 10 secs): Oracle RMAN error:
So the truncate is not the cause of this? But truncate fails due to "connection reset by peer"?
|
Not sure if this is comparable, but S3 Fuse seems to be using this: S3_RETRYABLE_ERRORS = ( |
So, Azure Support is telling me that I am exhausting the disk (P30 - 200MBps) and that this might cause the problem!? |
Blobfuse being a client side tool have no idea on your disk quota or IOPS limit in the backend. Unless backend returns back a specific information, blobfuse has no way to show or throttle anything. From blobfuse end when a operation is throttled, it will just retry after taking a backoff time but in no capacity it can tell why backend decided to throttle. In some scenarios backend return back specific errors and from our sdk-trace logs those error can be digged out. |
Hi @vibhansa-msft . |
I am trying to run the exact same backup using the Azure File Share - just to see the load / performance. |
I have found several articles / blog-posts mentioning the use of Blobfuse for Oracle backups -https://bluexp.netapp.com/blog/azure-cvo-blg-oracle-rman-backup-on-azure-blob - but perhaps it worked better with Blobfuse v1? |
@vibhansa-msft , I did not see your comment on the sdk-trace yesterday. Apologies. |
I now have the error again with sdk-tracing enabled:
sdk-trace:
|
Using Blobfuse2 is also documented as one possible solution here: |
|
Enabling the SDK-Trace will help if backend returns back with some specific error code and message that shows signs of throttling in the backend. If the connect just gets aborted then we will not get much info from the logs, in above errors I see "connection reset by peer" which means we will not receive any error code from storage, and it's just disconnected abruptly. This might be due to various reasons and at blobfuse end we cannot assume throttling being the reason. If backend team is saying its throttled, then they have logs to prove that nothing from blobfuse end we can debug here. When you say with Files things are working fine, that may be due to a different backend storage all together. Files and blobs cannot be compared as they are handled differently in the backend. If we are observing here that with blobfuse its getting throttled then you can try other tools like AzCopy to download good volume of data from stoarge account and check if that gets throttled as well or not (just to validate this theory of throttling.) If you increase some of the caching timeouts, then it will result back in lesser calls to storage and that might save some of these throttling errors. |
@vibhansa-msft. What I do see with the default today vs. Friday is that we are not getting the same throughput. Friday ~1420 MBps. Today ~1250 MBps. |
@vibhansa-msft, it's really strange and honestly very confusing. |
When you run with all these different config, do we check on the graphs as to where we reach in terms of bandwidth utilization. |
When we say it failed it shall be corelated back to the overall bandwidth usage or backend throttling kicking in. Sometimes even the same config may not hit exactly the same issue if responses from backend is slow as we will not be able to post the next requests at the same pace. |
We started checking the graphs later, but according to Azure Support (when asking them to check yesterday) they say that we are NOT hitting the VM bandwidth limit. The VM we are using is a D8ds_v5 with 8 vCPU's. Not 32. |
So in that case we shall keep the max-concurrency values to lower only may be 8 or even lower. One thing you can try out is to keep this value to '6' and block-size you can increase to 16MB to reduce the number of independent requests going to storage. |
@vibhansa-msft, what's the background for suggesting this change when everything seems to be running fine (except for when it doesn't :)) ? |
yes that's fine, I was suggesting based on the VM config that says you have only 8 vCPUs so even having 8 threads may create resource contention at times. |
It's also our own "feeling" that this makes sense. |
Agree, but this is beyond Blobfuse control as well. Network needs to provide inputs on what has changed or why we are not able to hit those numbers again as from blobfuse end its the same config again. One way to boost performance from our side is to disable sdk-trace and reset log-level to LOG_ERR. This will reduce the logging overhead and put more stress on the network and you might hit the same network limits again. Though this is just for experiment and does not prove much. |
@vibhansa-msft, just installed the preview and mounted with the --ignore-sync option. |
Any interesting find with the new binary and ignore-sync option? |
Hi @vibhansa-msft .
But it seems that files are still downloaded again for the "restore validation" part. There's so much logging, so difficult to find out if some of the files are served directly from cache and not downloaded again. But thinking about it, then when doing the restore validate, it will start to download what's not in cache, but this will also make cache run full and then evict the oldest (?? lru ??) files. Files with timestamp 19:19 were the last files being backed up, but the re-downloaded files (18:35-18:57) accumulates quickly and then the oldest from backup (19:19) gets evicted:
After eviction:
|
But I am pretty sure '--ignore-sync' works as expected :) |
Hi @vibhansa-msft.
blobfuse2_using-cache_20230907.zip |
In these logs I do not see any hit of "connection reset" or any sort of failure where blobfuse needed to retry. Looks good to me with the current set of changes. Anything unnatural that you observe during this run? |
I also see logs like below which indicate we did serve the files from local cache instead of redownloading them,
|
Exact sequence that we expected for a file that was synced and still not removed from cache so that it can be later reused:
|
I think it works as expected. Did another run with timeout-sec set to 60 seconds to force eviction. I have attached the files. P.S. I added you to the support case mail. pacacust02testord52941c5-db-1.backend.mdm.stibosystems.com_step_incr0_2023-09-07_1548_2412257.log |
At least I believe we can conclude that ignore-sync works and you can release that. |
Yes I have already put it up in a PR and in next release this feature will be made public. |
For the '60' second timeout part I can see cache clearance kicking in every minute. If you wish to validate that part you can search for below log in the logs file and see the time mentioned for that log line.
At times this line is followed by another line that says there are 1 or 2 files which can be evicted due to this kick-in and it means logic is working as expected. |
@vibhansa-msft , when should I expect 2.1.1 to be released? |
As of now exact date is not planned. We have just released 2.1.0 so you can expect alteast 2 months before the next release. |
@vibhansa-msft, we have the 2.1.1-preview.1. I guess we can use that? no? |
One last question, on the health-monitor. |
@souravgupta-msft can you provide required info on the health-monitor here. |
@mortenjoenby, you can refer the health monitor config from here. I would like to mention that network profiler is not currently implemented, and health monitor is a preview feature. |
Which version of blobfuse was used?
v2.0.5
Which OS distribution and version are you using?
Oracle Linux Server release 8.7 - 4.18.0-425.13.1.el8_7.x86_64
If relevant, please share your mount command.
What was the issue encountered?
We are running Oracle RMAN backups today, backing up to Azure Files (Premium), but want to move to Blob storage.
I am testing this out, but currently when running a backup, it fails pretty quickly with the following error:
This seem to happen after some time with the following connection error from Blobfuse:
Have you found a mitigation/solution?
No, but thinking if it's related to some of the timeout/expiration values used, if file_cache has to renew in the middle of the operation?
Blobfuse2 configuration
Please share logs if available.
The text was updated successfully, but these errors were encountered: