Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oracle RMAN backup fails #1230

Closed
mortenjoenby opened this issue Aug 19, 2023 · 170 comments · Fixed by #1232 or #1238
Closed

Oracle RMAN backup fails #1230

mortenjoenby opened this issue Aug 19, 2023 · 170 comments · Fixed by #1232 or #1238
Assignees
Labels
Milestone

Comments

@mortenjoenby
Copy link

Which version of blobfuse was used?

v2.0.5

Which OS distribution and version are you using?

Oracle Linux Server release 8.7 - 4.18.0-425.13.1.el8_7.x86_64

If relevant, please share your mount command.

What was the issue encountered?

We are running Oracle RMAN backups today, backing up to Azure Files (Premium), but want to move to Blob storage.
I am testing this out, but currently when running a backup, it fails pretty quickly with the following error:

channel C1: starting incremental level 0 datafile backup set
channel C1: specifying datafile(s) in backup set
input datafile file number=00013 name=/database/db1/oradata/step/stepsysblob_step_5.dbf
channel C1: starting piece 1 at 19-AUG-2023 13:56:27
released channel: C1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup plus archivelog command at 08/19/2023 13:58:02
ORA-19510: failed to set size of 1007750 blocks for file "/fuse-backup/step/0/2023-08-18_1256/STEP_102493_1_2t2477kb_20230819.incr0" (block size=8192)
ORA-27059: could not reduce file size
Linux-x86_64 Error: 5: Input/output error
Additional information: 2

This seem to happen after some time with the following connection error from Blobfuse:

Sat Aug 19 13:52:42 UTC 2023 : blobfuse2[6036] : LOG_TRACE [block_blob.go (836)]: BlockBlob::WriteFromFile : name step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0
Sat Aug 19 13:52:42 UTC 2023 : blobfuse2[6036] : LOG_INFO [block_blob.go (819)]: BlockBlob::calculateBlockSize : step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0 size 8255496192, blockSize 4194304
Sat Aug 19 13:52:42 UTC 2023 : blobfuse2[6036] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0?blockid=UzPry3AyRIpBa%2BqlTdVWqA%3D%3D&comp=block&timeout=901": use of closed network connection]
Sat Aug 19 13:52:42 UTC 2023 : blobfuse2[6036] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0?blockid=1gpkBl4pQf5DN2TGRL%2FAzQ%3D%3D&comp=block&timeout=901": write tcp 10.100.0.69:18844->20.60.222.65:443: write: connection reset by peer]
Sat Aug 19 13:52:42 UTC 2023 : blobfuse2[6036] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0?blockid=OOu7RGw1RBBPSAH6k%2BmnFg%3D%3D&comp=block&timeout=901": write tcp 10.100.0.69:18962->20.60.222.65:443: write: connection reset by peer]
Sat Aug 19 13:52:42 UTC 2023 : blobfuse2[6036] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0?blockid=FCIZ67fdSm5FpjCGNy3Ebw%3D%3D&comp=block&timeout=901": write tcp 10.100.0.69:19098->20.60.222.65:443: write: connection reset by peer]
Sat Aug 19 13:52:42 UTC 2023 : blobfuse2[6036] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0?blockid=D5gLXzKARY5TkO1lU6wb5Q%3D%3D&comp=block&timeout=901": write tcp 10.100.0.69:18896->20.60.222.65:443: write: connection reset by peer]
Sat Aug 19 13:52:42 UTC 2023 : blobfuse2[6036] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0?blockid=ODnsCTI3QYVBgW13tlE0CA%3D%3D&comp=block&timeout=901": use of closed network connection]
Sat Aug 19 13:52:42 UTC 2023 : blobfuse2[6036] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0?blockid=WGuEGr%2FLS2dHJWX3Pq1TmA%3D%3D&comp=block&timeout=901": use of closed network connection]
Sat Aug 19 13:52:42 UTC 2023 : blobfuse2[6036] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0?blockid=x8FdQSwNQdF9kMHuUO8ncA%3D%3D&comp=block&timeout=901": use of closed network connection]
Sat Aug 19 13:52:42 UTC 2023 : blobfuse2[6036] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0?blockid=pVu1OJKoQzJaKf4BXk9WZw%3D%3D&comp=block&timeout=901": use of closed network connection]
Sat Aug 19 13:53:02 UTC 2023 : blobfuse2[6036] : LOG_DEBUG [block_blob.go (906)]: BlockBlob::WriteFromFile : Upload complete of blob step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0
Sat Aug 19 13:53:02 UTC 2023 : blobfuse2[6036] : LOG_CRIT [block_blob.go (914)]: TimeTracker :: [BlockBlob::WriteFromFile] step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0 => 20.095938688s
Sat Aug 19 13:53:02 UTC 2023 : blobfuse2[6036] : LOG_TRACE [libfuse_handler.go (836)]: Libfuse::libfuse_release : step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0, handle: 1972
Sat Aug 19 13:53:02 UTC 2023 : blobfuse2[6036] : LOG_TRACE [file_cache.go (956)]: FileCache::CloseFile : name=step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0, handle=1972

Have you found a mitigation/solution?

No, but thinking if it's related to some of the timeout/expiration values used, if file_cache has to renew in the middle of the operation?

Blobfuse2 configuration

allow-other: false

logging:
  type: base
  level: log_debug
  max-file-size-mb: 32
  file-count: 10
  track-time: true

components:
  - libfuse
  - file_cache
  - azstorage

libfuse:
  default-permission: 0644
  # Required for Oracle RMAN
  ignore-open-flags: true

file_cache:
  # Pre-create dir, should be done at startup if not already existing
  path: /mnt/blobfusetmp
  allow-non-empty-temp: true
  cleanup-on-start: true

Please share logs if available.

@mortenjoenby
Copy link
Author

blobfuse2.log

@mortenjoenby
Copy link
Author

I tried changing the timeouts/expiration configuration options:

allow-other: false

logging:
  type: base
  level: log_debug
  max-file-size-mb: 32
  file-count: 10
  track-time: true

components:
  - libfuse
  - file_cache
  - azstorage

libfuse:
  default-permission: 0644
  attribute-expiration-sec: 300
  entry-expiration-sec: 300
  negative-entry-expiration-sec: 300
  # Required for Oracle RMAN
  ignore-open-flags: true

file_cache:
  # Pre-create dir, should be done at startup if not already existing
  path: /mnt/blobfusetmp
  timeout-sec: 300
  allow-non-empty-temp: true
  cleanup-on-start: true

But I still get the error, and seems to be related to "connection reset by peer" on the storage account.

piece handle=/fuse-backup/step/0/2023-08-18_1256/STEP_102512_1_3g247b95_20230819.incr0 tag=T2023_08_18_1256INCR0 comment=NONE
channel C1: backup set complete, elapsed time: 00:02:45
channel C1: starting incremental level 0 datafile backup set
channel C1: specifying datafile(s) in backup set
input datafile file number=00029 name=/database/db1/oradata/step/stepsysdata_step_8.dbf
channel C1: starting piece 1 at 19-AUG-2023 15:01:30
released channel: C1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup plus archivelog command at 08/19/2023 15:03:15
ORA-19510: failed to set size of 1047685 blocks for file "/fuse-backup/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0" (block size=8192)
ORA-27059: could not reduce file size
Linux-x86_64 Error: 5: Input/output error
Additional information: 2

Between these two timestamps - 15:01:30 -> 15:03:15 - I get the connection errors in the blobfuse2.log:

Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [block_blob.go (526)]: BlockBlob::GetAttr : name step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [block_blob.go (472)]: BlockBlob::getAttrUsingList : name step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [block_blob.go (540)]: BlockBlob::List : prefix step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0, marker
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [block_blob.go (499)]: BlockBlob::getAttrUsingList : Item 0 Blob STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-back
up/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-back
up/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-back
up/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-back
up/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-back
up/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [block_blob.go (756)]: BlockBlob::ReadBuffer : Failed to download blob step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0 [read tcp 10.100.0.69:34936->20.60.250.33:443: read: connection reset by peer]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [block_blob.go (1065)]: BlockBlob::TruncateFile : Failed to read small file step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0%!(EXTRA string=read tcp 10.100.0.69:34936->20.60.250.33:443: read: connection reset by peer)
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [file_cache.go (1325)]: FileCache::TruncateFile : step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0 failed to truncate [read tcp 10.100.0.69:34936->20.60.250.33:443: read: connection reset by peer]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [libfuse_handler.go (816)]: Libfuse::libfuse_truncate : error truncating file step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0 [read tcp 10.100.0.69:34936->20.60.250.33:443: read: connection reset by peer]
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [libfuse_handler.go (1120)]: Libfuse::libfuse_utimens : step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [file_cache.go (1206)]: FileCache::GetAttr : step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [block_blob.go (526)]: BlockBlob::GetAttr : name step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [block_blob.go (472)]: BlockBlob::getAttrUsingList : name step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [block_blob.go (540)]: BlockBlob::List : prefix step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0, marker
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [block_blob.go (499)]: BlockBlob::getAttrUsingList : Item 0 Blob STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_DEBUG [file_cache.go (1237)]: FileCache::GetAttr : updating step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0 from local cache
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [libfuse_handler.go (779)]: Libfuse::libfuse_flush : step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0, handle: 91
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [file_cache.go (1126)]: FileCache::FlushFile : handle=91, path=step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [azstorage.go (489)]: AzStorage::CopyFromFile : Upload file step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_TRACE [block_blob.go (836)]: BlockBlob::WriteFromFile : name step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_INFO [block_blob.go (819)]: BlockBlob::calculateBlockSize : step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0 size 8582643712, blockSize 4194304
Sat Aug 19 15:02:26 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?blockid=I05Lsyr%2BT6RfTqrx3aLp5Q%3D%3D&comp=block&timeout=901": write tcp 10.100.0.69:34966->20.60.250.33:443: write: connection reset by peer]
Sat Aug 19 15:02:26 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?blockid=r%2BM3oa%2BYT19XGPI1a%2FgmZw%3D%3D&comp=block&timeout=901": write tcp 10.100.0.69:29710->20.60.196.225:443: write: connection reset by peer]
Sat Aug 19 15:02:47 UTC 2023 : blobfuse2[15756] : LOG_DEBUG [block_blob.go (906)]: BlockBlob::WriteFromFile : Upload complete of blob step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:47 UTC 2023 : blobfuse2[15756] : LOG_CRIT [block_blob.go (914)]: TimeTracker :: [BlockBlob::WriteFromFile] step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0 => 21.162660235s
Sat Aug 19 15:02:47 UTC 2023 : blobfuse2[15756] : LOG_TRACE [libfuse_handler.go (836)]: Libfuse::libfuse_release : step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0, handle: 91
Sat Aug 19 15:02:47 UTC 2023 : blobfuse2[15756] : LOG_TRACE [file_cache.go (956)]: FileCache::CloseFile : name=step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0, handle=91
Sat Aug 19 15:02:47 UTC 2023 : blobfuse2[15756] : LOG_TRACE [lru_policy.go (178)]: lruPolicy::CacheInvalidate : /mnt/blobfusetmp/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:47 UTC 2023 : blobfuse2[15756] : LOG_TRACE [libfuse_handler.go (1032)]: Libfuse::libfuse_fsync : step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0, handle: 89
Sat Aug 19 15:02:47 UTC 2023 : blobfuse2[15756] : LOG_TRACE [file_cache.go (1091)]: FileCache::SyncFile : handle=89, path=step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:47 UTC 2023 : blobfuse2[15756] : LOG_TRACE [libfuse_handler.go (779)]: Libfuse::libfuse_flush : step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0, handle: 89
Sat Aug 19 15:02:47 UTC 2023 : blobfuse2[15756] : LOG_TRACE [file_cache.go (1126)]: FileCache::FlushFile : handle=89, path=step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:47 UTC 2023 : blobfuse2[15756] : LOG_TRACE [azstorage.go (489)]: AzStorage::CopyFromFile : Upload file step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:47 UTC 2023 : blobfuse2[15756] : LOG_TRACE [block_blob.go (836)]: BlockBlob::WriteFromFile : name step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:02:47 UTC 2023 : blobfuse2[15756] : LOG_INFO [block_blob.go (819)]: BlockBlob::calculateBlockSize : step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0 size 8582643712, blockSize 4194304
Sat Aug 19 15:03:08 UTC 2023 : blobfuse2[15756] : LOG_DEBUG [block_blob.go (906)]: BlockBlob::WriteFromFile : Upload complete of blob step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0
Sat Aug 19 15:03:08 UTC 2023 : blobfuse2[15756] : LOG_CRIT [block_blob.go (914)]: TimeTracker :: [BlockBlob::WriteFromFile] step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0 => 21.11014969s

@mortenjoenby
Copy link
Author

This article - https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/oracle-database-backup-strategies#blobfuse - is fairly new (or updated lately) and I noticed this paragraph:

"Blobfuse is ubiquitous across Azure regions and works with all storage account types, including general-purpose v1/v2 and Azure Data Lake Storage Gen2. But it doesn't perform as well as alternative protocols. For suitability as the database backup medium, we recommend using the SMB or NFS protocol to mount Azure Blob Storage."

Do you have recommendations / experience with this? Is Blobfuse2 a bad use case for doing Oracle RMAN backups/restores?

We have two main objectives:

  1. Reduce cost - Blob is 1/10th of Premium Files
  2. Increase performance - Azure Files perform really bad

Not sure 2 is possible ....

@vibhansa-msft
Copy link
Member

If your backend storage is Azure Files, then Blobfuse is the not the tool to use as we do not support Files.
For Blob storage depending upon your config and work-flow the performance varies. There are many customers who use Blobfuse with RMAN backup (blobs being their backend storage). Moving from Files to Blobs will definatly be a cost saving so there is nothing to comment on the first part.
For the second part, Blobfuse works with local disk caching which means any file that it read/write it will create a local copy. In case of backup work-flows this implies there will be an extra copy made locally on your disk first and then data is uploaded to backend storage. Once backup is done this temp copy will be deleted but due to this local caching it will slow down a bit. Any file that you read and compare to figure out there is a diff or not will also be downloaded to local disk first and then compared (unless RMAN rejects the file comparison with just the last-modified-time check). Other then this I do not see any other issue here.

You have removed attr_cache from your pipeline which means blobfuse is not caching any metadata and all properties are retrived directly from storage. You can further reduce the cost by enabling attr-cache.

components:
  - libfuse
  - file_cache
  - attr_cache
  - azstorage

In the error logs that you have posted, it shows "context cancelled" which points to a situation where either operation was cancelled from application end or there was a timeout. And there are some "connection reset by peer" errors which might be pointing to backend errors mostly a transient issue from storage.

Can you share what kind of authentication mode you are using as I do not see your azstorage config in this one. You can remove the secrets and keys while sharing it,

@mortenjoenby
Copy link
Author

Hi @vibhansa-msft .
Just to be clear, the use of Azure Files is what we are using, but want to get away from.
For the testing (and our plan) we use a Standardv2 / hot Storage Account.

I am using key authentication. This is the "azstorage" section. I left it out.

azstorage:
  type: block
  account-name: mybackup
  account-key: **********
  endpoint: https://mybackup.blob.core.windows.net
  mode: key
  container: wetestorablob-backup

Can I please ask you to explain in detail what the attribute caching mean? I am not sure I get that part, and also why it reduces cost? Due to less network roundtrips?
"You have removed attr_cache from your pipeline which means blobfuse is not caching any metadata and all properties are retrived directly from storage. You can further reduce the cost by enabling attr-cache."

@mortenjoenby
Copy link
Author

This is the debug log when starting/mounting:

Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_CRIT [mount.go (402)]: Starting Blobfuse2 Mount : 2.0.5 on [Oracle Linux Server 8.7]
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_CRIT [mount.go (403)]: Logging level set to : LOG_DEBUG
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_TRACE [libfuse.go (253)]: Libfuse::Configure : libfuse
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_INFO [libfuse.go (244)]: Libfuse::Validate : UID 501, GID 501
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_INFO [libfuse.go (299)]: Libfuse::Configure : read-only false, allow-other false, allow-root false, default-perm 420, entry-timeout 300, attr-time 300, negative-timeout 300, ignore-open-flags: true, nonempty false
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_TRACE [file_cache.go (205)]: FileCache::Configure : file_cache
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_INFO [file_cache.go (282)]: FileCache::Configure : Using default eviction policy
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_INFO [file_cache.go (304)]: FileCache::Configure : create-empty false, cache-timeout 300, tmp-path /mnt/blobfusetmp, max-size-mb 0, high-mark 80, low-mark 60, refresh-sec 0
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_TRACE [azstorage.go (84)]: AzStorage::Configure : azstorage
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_TRACE [config.go (284)]: ParseAndValidateConfig : Parsing config
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_INFO [config.go (388)]: ParseAndValidateConfig : using the following proxy address from the config file:
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_INFO [config.go (392)]: ParseAndValidateConfig : sdk logging from the config file: false
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_TRACE [config.go (499)]: ParseAndReadDynamicConfig : Reparsing config
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_DEBUG [config.go (399)]: ParseAndValidateConfig : Getting auth type
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_INFO [config.go (485)]: ParseAndValidateConfig : Account: pacacust02weassetse6dfb2, Container: pacacust02wetestorablob-backup, AccountType: BLOCK, Auth: KEY, Prefix: , Endpoint: https://pacacust02weassetse6dfb2.blob.core.windows.net/, ListBlock: 0, MD5 : false false, Virtual Directory: true, Max Results For List 2, Disable Compression: false
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_INFO [config.go (489)]: ParseAndValidateConfig : Retry Config: Retry count 5, Max Timeout 900, BackOff Time 4, Max Delay 60
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_INFO [config.go (492)]: ParseAndValidateConfig : Telemetry :
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_TRACE [block_blob.go (173)]: BlockBlob::SetupPipeline : Setting up
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_TRACE [block_blob.go (135)]: BlockBlob::getCredential : Getting credential
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_DEBUG [azauth.go (80)]: azAuth::getAzAuth : Account: pacacust02weassetse6dfb2, AccountType: BLOCK, Protocol: https, Endpoint: https://pacacust02weassetse6dfb2.blob.core.windows.net/
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_TRACE [block_blob.go (260)]: BlockBlob::SetPrefixPath : path
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_TRACE [block_blob.go (209)]: BlockBlob::TestPipeline : Validating
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_INFO [mount.go (412)]: mount: Mounting blobfuse2 on /fuse-backup
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_DEBUG [mount.go (450)]: mount: foreground disabled, child = false
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_CRIT [mount.go (402)]: Starting Blobfuse2 Mount : 2.0.5 on [Oracle Linux Server 8.7]
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_CRIT [mount.go (403)]: Logging level set to : LOG_DEBUG
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [libfuse.go (253)]: Libfuse::Configure : libfuse
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [libfuse.go (244)]: Libfuse::Validate : UID 501, GID 501
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [libfuse.go (299)]: Libfuse::Configure : read-only false, allow-other false, allow-root false, default-perm 420, entry-timeout 300, attr-time 300, negative-timeout 300, ignore-open-flags: true, nonempty false
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [file_cache.go (205)]: FileCache::Configure : file_cache
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [file_cache.go (282)]: FileCache::Configure : Using default eviction policy
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [file_cache.go (304)]: FileCache::Configure : create-empty false, cache-timeout 300, tmp-path /mnt/blobfusetmp, max-size-mb 0, high-mark 80, low-mark 60, refresh-sec 0
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [azstorage.go (84)]: AzStorage::Configure : azstorage
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [config.go (284)]: ParseAndValidateConfig : Parsing config
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [config.go (388)]: ParseAndValidateConfig : using the following proxy address from the config file:
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [config.go (392)]: ParseAndValidateConfig : sdk logging from the config file: false
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [config.go (499)]: ParseAndReadDynamicConfig : Reparsing config
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_DEBUG [config.go (399)]: ParseAndValidateConfig : Getting auth type
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [config.go (485)]: ParseAndValidateConfig : Account: pacacust02weassetse6dfb2, Container: pacacust02wetestorablob-backup, AccountType: BLOCK, Auth: KEY, Prefix: , Endpoint: https://pacacust02weassetse6dfb2.blob.core.windows.net/, ListBlock: 0, MD5 : false false, Virtual Directory: true, Max Results For List 2, Disable Compression: false
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [config.go (489)]: ParseAndValidateConfig : Retry Config: Retry count 5, Max Timeout 900, BackOff Time 4, Max Delay 60
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [config.go (492)]: ParseAndValidateConfig : Telemetry :
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [block_blob.go (173)]: BlockBlob::SetupPipeline : Setting up
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [block_blob.go (135)]: BlockBlob::getCredential : Getting credential
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_DEBUG [azauth.go (80)]: azAuth::getAzAuth : Account: pacacust02weassetse6dfb2, AccountType: BLOCK, Protocol: https, Endpoint: https://pacacust02weassetse6dfb2.blob.core.windows.net/
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [block_blob.go (260)]: BlockBlob::SetPrefixPath : path
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [mount.go (412)]: mount: Mounting blobfuse2 on /fuse-backup
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_DEBUG [mount.go (450)]: mount: foreground disabled, child = true
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_DEBUG [mount.go (541)]: Mount::runPipeline : blobfuse2 pid = 104707, transfer pipe = /tmp/transferPipe_104707, polling pipe = /tmp/pollPipe_104707
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [azstorage.go (165)]: AzStorage::Start : Starting component azstorage
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [file_cache.go (147)]: Starting component : file_cache
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [lru_policy.go (112)]: lruPolicy::StartPolicy
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [cache_policy.go (85)]: cachePolicy::getCacheUsage : /mnt/blobfusetmp
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_DEBUG [cache_policy.go (143)]: cachePolicy::getCacheUsage : current cache usage : 0.003906MB
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [lru_policy.go (138)]: lruPolicy::StartPolicy : Policy set with 300 timeout
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [libfuse.go (150)]: Libfuse::Start : Starting component libfuse
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [libfuse_handler.go (119)]: Libfuse::initFuse : Initializing FUSE3
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [libfuse_handler.go (162)]: Libfuse::initFuse : Registering fuse callbacks
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [libfuse_handler.go (166)]: Libfuse::initFuse : Populating fuse arguments
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [libfuse_handler.go (190)]: Libfuse::populateFuseArgs
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_DEBUG [libfuse_handler.go (231)]: Libfuse::populateFuseArgs : opts : blobfuse2
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_DEBUG [libfuse_handler.go (231)]: Libfuse::populateFuseArgs : opts : /fuse-backup
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_DEBUG [libfuse_handler.go (231)]: Libfuse::populateFuseArgs : opts : -o
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_DEBUG [libfuse_handler.go (231)]: Libfuse::populateFuseArgs : opts : entry_timeout=300,attr_timeout=300,negative_timeout=300
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [lru_policy.go (276)]: lruPolicy::ClearCache
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_DEBUG [libfuse_handler.go (231)]: Libfuse::populateFuseArgs : opts : -f
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_DEBUG [libfuse_handler.go (231)]: Libfuse::populateFuseArgs : opts : -ofsname=blobfuse2
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_DEBUG [libfuse_handler.go (231)]: Libfuse::populateFuseArgs : opts : -okernel_cache
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [libfuse_handler.go (178)]: Libfuse::initFuse : Mounting with fuse3 library
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_TRACE [libfuse_handler.go (251)]: Libfuse::libfuse_init : init
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [libfuse_handler.go (253)]: Libfuse::NotifyMountToParent : Notifying parent for successful mount
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [libfuse_handler.go (260)]: Libfuse::libfuse_init : Kernel Caps : 10485723
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [libfuse_handler.go (267)]: Libfuse::libfuse_init : Enable Capability : FUSE_CAP_PARALLEL_DIROPS
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [libfuse_handler.go (273)]: Libfuse::libfuse_init : Enable Capability : FUSE_CAP_AUTO_INVAL_DATA
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [libfuse_handler.go (280)]: Libfuse::libfuse_init : Enable Capability : FUSE_CAP_READDIRPLUS
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [libfuse_handler.go (286)]: Libfuse::libfuse_init : Enable Capability : FUSE_CAP_ASYNC_READ
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [libfuse_handler.go (292)]: Libfuse::libfuse_init : Enable Capability : FUSE_CAP_SPLICE_WRITE
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104707] : LOG_INFO [libfuse_handler.go (304)]: Libfuse::libfuse_init : Enable Capability : FUSE_CAP_WRITEBACK_CACHE
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_INFO [mount.go (464)]: mount: Child [104707] mounted successfully at /fuse-backup

I notice this part:

Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_INFO [config.go (485)]: ParseAndValidateConfig : Account: pacacust02weassetse6dfb2, Container: pacacust02wetestorablob-backup, AccountType: BLOCK, Auth: KEY, Prefix: , Endpoint: https://pacacust02weassetse6dfb2.blob.core.windows.net/, ListBlock: 0, MD5 : false false, Virtual Directory: true, Max Results For List 2, Disable Compression: false
Mon Aug 21 07:24:44 UTC 2023 : blobfuse2[104698] : LOG_INFO [config.go (489)]: ParseAndValidateConfig : Retry Config: Retry count 5, Max Timeout 900, BackOff Time 4, Max Delay 60

The retry count, max timeout etc.

In the previous log the request says:

 13:52:42 UTC 2023 : blobfuse2[6036] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102491_1_2r24779n_20230819.incr0?blockid=UzPry3AyRIpBa%2BqlTdVWqA%3D%3D&comp=block&timeout=901": use of closed network connection]

and

Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0?timeout=901": context canceled]

@mortenjoenby
Copy link
Author

Question:
Would it be better to use streaming mode in this case instead of file_cache ?

@vibhansa-msft
Copy link
Member

Attr-Cache:

Here Blobfuse will cache the metadata of each blob like size, LMT etc. This will be cached in process (no disk space used) and will be returned back to kernel when it requires. Getting attributes of a file or directory is quite frequent operation in kernel (before every open/write there will be calls to this) and if caching is disabled then each of these calls will result into a listBlob api call to storage (which is charged API). Thus, having a cache in blobfuse will reduce your cost and network delays.

Streaming:

Streaming as a solution shall be used only when you do not have enough of disk space. For e.g., let's say you have a file of 1TB size, and you want to read/process it but do not have this much of free disk space. In such cases streaming is recommended however data is downloaded on demands, this solution will be quite slow. This is useful in workflow where either you want to deal with fairly large files, or you wish to just read some portion of a file and do not wish to download entire contents.

Retry:

Default configuration for retry shall be left as is unless there is a very specific use-case which requires its tuning. Here in your case its most likely network flakiness which is causing connection termination. So, you shall be looking at why the connections are not stable. Might be some sort of throttling happening at your account level which is causing backend server to close the connections.

@mortenjoenby
Copy link
Author

I have found different blog posts with people doing it - eg. https://www.bluecrystal.com.au/news/the-options-for-oracle-backup-in-azure-blobfuse/ - so wondering why we have issues. Most examples I find though seems to be using Blobfuse v1.4/1.3.

You refer to unstable connection. Perhaps. Our VM and blob container are both in the same region (West Europe).
Do you suggest that I create a support ticket with Azure Support?

(BTW, what does "LMT" mean :)) ?

@vibhansa-msft
Copy link
Member

LMT : Last modified time.
There are failures in the logs because connection was reset from server abruptly. This needs investigation from backend side.

@mortenjoenby
Copy link
Author

@vibhansa-msft , do you have some good references internally or externally, or examples of others doing Oracle backups using Blobfuse?

@mortenjoenby
Copy link
Author

Looking at the libfuse_handler.go it seems that there's no retry mechanism implemented. Why?

@vibhansa-msft
Copy link
Member

There are layers in blobfuse, libfuse is not the one responsible for retries and all. Retries are done in azstorage component.

@mortenjoenby
Copy link
Author

So @vibhansa-msft , when seeing this:

Sat Aug 19 15:02:25 UTC 2023 : blobfuse2[15756] : LOG_ERR [libfuse_handler.go (816)]: Libfuse::libfuse_truncate : error truncating file step/0/2023-08-18_1256/STEP_102513_1_3h247bea_20230819.incr0 [read tcp 10.100.0.69:34936->20.60.250.33:443: read: connection reset by peer]

We can't tell if it's actually retrying. Is it ?

@vibhansa-msft
Copy link
Member

vibhansa-msft commented Aug 21, 2023

We need to understand why the connection is being reset from backend server here.
Connection reset is not a retryable error from our end. Only server returning back 5xx error codes are retried.

@mortenjoenby
Copy link
Author

I have opened a support case with Azure Support. Hoping to get some assistance.
But they are already saying that there is no "outage" seen on the storage account ...

@vibhansa-msft
Copy link
Member

In your config can you enabel "sdk_trace: true" under "azstorage" section. This will start dumping the REST calls we are making and in that there will a request-id field printed. For any failed request we can use this field value to trace logs in the backend so support team can help once you provide this.

@mortenjoenby
Copy link
Author

Failed again (timeout 10 secs):

Oracle RMAN error:

channel C1: starting piece 1 at 21-AUG-2023 12:30:48
released channel: C1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup plus archivelog command at 08/21/2023 12:32:23
ORA-19510: failed to set size of 1047685 blocks for file "/fuse-backup/step/0/2023-08-21_1125/STEP_102594_1_6224cbbo_20230821.incr0" (block size=8192)
ORA-27059: could not reduce file size
Linux-x86_64 Error: 5: Input/output error
Additional information: 2

So the truncate is not the cause of this? But truncate fails due to "connection reset by peer"?
Is that correct?

Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_TRACE [block_blob.go (540)]: BlockBlob::List : prefix step/0/2023-08-21_1125/STEP_102594_1_6224cbbo_20230821.incr0, marker
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_TRACE [block_blob.go (499)]: BlockBlob::getAttrUsingList : Item 0 Blob STEP_102594_1_6224cbbo_20230821.incr0
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102594_1_6224
cbbo_20230821.incr0?timeout=901": context canceled]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [block_blob.go (756)]: BlockBlob::ReadBuffer : Failed to download blob step/0/2023-08-21_1125/STEP_102594_1_6224cbbo_20230821.incr0 [read tcp 10.100.0.69:57506->20.60.250.33:443: read: connection reset by peer]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [block_blob.go (1065)]: BlockBlob::TruncateFile : Failed to read small file step/0/2023-08-21_1125/STEP_102594_1_6224cbbo_20230821.incr0%!(EXTRA string=read tcp 10.100.0.69:57506->20.60.250.33:443: read: connection
 reset by peer)
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [file_cache.go (1325)]: FileCache::TruncateFile : step/0/2023-08-21_1125/STEP_102594_1_6224cbbo_20230821.incr0 failed to truncate [read tcp 10.100.0.69:57506->20.60.250.33:443: read: connection reset by peer]
Mon Aug 21 12:31:31 UTC 2023 : blobfuse2[115403] : LOG_ERR [libfuse_handler.go (816)]: Libfuse::libfuse_truncate : error truncating file step/0/2023-08-21_1125/STEP_102594_1_6224cbbo_20230821.incr0 [read tcp 10.100.0.69:57506->20.60.250.33:443: read: connection reset by pe
er]

@mortenjoenby
Copy link
Author

Not sure if this is comparable, but S3 Fuse seems to be using this:

S3_RETRYABLE_ERRORS = (
socket.timeout,
HTTPClientError,
IncompleteRead,
FSTimeoutError,
ResponseParserError,
)
https://github.com/fsspec/s3fs/blob/39125f79051c55fc715cf056d8a2d0c7aa9d1c4b/s3fs/core.py#L63

@mortenjoenby
Copy link
Author

So, Azure Support is telling me that I am exhausting the disk (P30 - 200MBps) and that this might cause the problem!?
Is that something you can confirm?
I honestly doubt it, but if that's the case, I believe Blobfuse2 should be able to handle it and let me throttle the bandwidth.
I am confused .... BUT, I will try to change to a larger disk to see what happens. Just to prove (or dis-prove) the case.
It's at least not a realistic scenario to increase all database disks on all our databases to solve such issue.

@vibhansa-msft
Copy link
Member

Blobfuse being a client side tool have no idea on your disk quota or IOPS limit in the backend. Unless backend returns back a specific information, blobfuse has no way to show or throttle anything. From blobfuse end when a operation is throttled, it will just retry after taking a backoff time but in no capacity it can tell why backend decided to throttle. In some scenarios backend return back specific errors and from our sdk-trace logs those error can be digged out.

@mortenjoenby
Copy link
Author

mortenjoenby commented Aug 22, 2023

Hi @vibhansa-msft .
I appreciate your quick responses! Not sure I fully understand... I can't figure out if exhausing the disk can cause Blobfuse to fail or not? In other words, does it help at all increasing disk size/performance? It should not be the solution at least.
And if Blobfuse cannot be throttled, how can I then (try) to avoid the exhaustion which could potentially hit performance of the database?
You also mention the sdk-trace logs. Is that something I can enable? Or are you referring to the debug logging I already have enabled?

@mortenjoenby
Copy link
Author

I am trying to run the exact same backup using the Azure File Share - just to see the load / performance.
This is not exhausting the disk, using ~100MBps, but each backup piece takes half the time compared to using Blobfuse2.

@mortenjoenby
Copy link
Author

I have found several articles / blog-posts mentioning the use of Blobfuse for Oracle backups -https://bluexp.netapp.com/blog/azure-cvo-blg-oracle-rman-backup-on-azure-blob - but perhaps it worked better with Blobfuse v1?
I really need some assistance in figuring out if this is a Blobfuse2 problem.

@mortenjoenby
Copy link
Author

@vibhansa-msft , I did not see your comment on the sdk-trace yesterday. Apologies.
I am running a backup now with that enabled.

@mortenjoenby
Copy link
Author

I now have the error again with sdk-tracing enabled:

channel C1: starting piece 1 at 22-AUG-2023 08:45:14
released channel: C1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup plus archivelog command at 08/22/2023 08:46:39
ORA-19510: failed to set size of 1005271 blocks for file "/fuse-backup/step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0" (block size=8192)
ORA-27059: could not reduce file size
Linux-x86_64 Error: 5: Input/output error
Additional information: 2

sdk-trace:

Tue Aug 22 08:45:55 UTC 2023 : blobfuse2[164407] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0?timeout=1801": context canceled]
Tue Aug 22 08:45:55 UTC 2023 : blobfuse2[164407] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0?timeout=1801": context canceled]
Tue Aug 22 08:45:55 UTC 2023 : blobfuse2[164407] : LOG_DEBUG [utils.go (240)]: SDK : ==> REQUEST/RESPONSE (Try=1/89.786663ms, OpTime=89.866349ms) -- REQUEST ERROR
   GET https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0?timeout=1801
   Authorization: REDACTED
   User-Agent: [Azure-Storage-Fuse/2.0.5 (Oracle Linux Server 8.7) Azure-Storage/0.15 (go1.20.5; linux)]
   X-Ms-Client-Request-Id: [512c7bb3-ad61-4d9e-698d-661f983d64d0]
   X-Ms-Range: [bytes=50331648-54525951]
   X-Ms-Version: [2020-10-02]
   x-ms-date: [Tue, 22 Aug 2023 08:45:55 GMT]
   --------------------------------------------------------------------------------
   ERROR:
-> github.com/Azure/azure-pipeline-go/pipeline.NewError, /home/cloudtest/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.4-0.20220425205405-09e6f201e1e4/pipeline/error.go:157
HTTP request failed

Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0?timeout=1801": context canceled


Tue Aug 22 08:45:55 UTC 2023 : blobfuse2[164407] : LOG_DEBUG [utils.go (240)]: SDK : ==> REQUEST/RESPONSE (Try=1/90.071353ms, OpTime=90.122083ms) -- REQUEST ERROR
   GET https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0?timeout=1801
   Authorization: REDACTED
   User-Agent: [Azure-Storage-Fuse/2.0.5 (Oracle Linux Server 8.7) Azure-Storage/0.15 (go1.20.5; linux)]
   X-Ms-Client-Request-Id: [9ec1475d-1a88-4d2e-6213-8aed48256b89]
   X-Ms-Range: [bytes=25165824-29360127]
   X-Ms-Version: [2020-10-02]
   x-ms-date: [Tue, 22 Aug 2023 08:45:55 GMT]
   --------------------------------------------------------------------------------
   ERROR:
-> github.com/Azure/azure-pipeline-go/pipeline.NewError, /home/cloudtest/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.4-0.20220425205405-09e6f201e1e4/pipeline/error.go:157
HTTP request failed

Get "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0?timeout=1801": context canceled


Tue Aug 22 08:45:55 UTC 2023 : blobfuse2[164407] : LOG_ERR [block_blob.go (756)]: BlockBlob::ReadBuffer : Failed to download blob step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0 [read tcp 10.100.0.69:48084->20.60.250.33:443: read: connection reset by peer]
Tue Aug 22 08:45:55 UTC 2023 : blobfuse2[164407] : LOG_ERR [block_blob.go (1065)]: BlockBlob::TruncateFile : Failed to read small file step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0%!(EXTRA string=read tcp 10.100.0.69:48084->20.60.250.33:443: read: connection reset by peer)
Tue Aug 22 08:45:55 UTC 2023 : blobfuse2[164407] : LOG_ERR [file_cache.go (1325)]: FileCache::TruncateFile : step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0 failed to truncate [read tcp 10.100.0.69:48084->20.60.250.33:443: read: connection reset by peer]
Tue Aug 22 08:45:55 UTC 2023 : blobfuse2[164407] : LOG_ERR [libfuse_handler.go (816)]: Libfuse::libfuse_truncate : error truncating file step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0 [read tcp 10.100.0.69:48084->20.60.250.33:443: read: connection reset by peer]
Tue Aug 22 08:45:55 UTC 2023 : blobfuse2[164407] : LOG_TRACE [libfuse_handler.go (1120)]: Libfuse::libfuse_utimens : step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0
Tue Aug 22 08:45:55 UTC 2023 : blobfuse2[164407] : LOG_TRACE [file_cache.go (1206)]: FileCache::GetAttr : step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0

@mortenjoenby
Copy link
Author

Using Blobfuse2 is also documented as one possible solution here:
https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/oracle-rman-streaming-backup

@mortenjoenby
Copy link
Author

Tue Aug 22 08:45:55 UTC 2023 : blobfuse2[164407] : LOG_ERR [utils.go (206)]: BlockBlob::newBlobfuse2HTTPClientFactory : HTTP request failed [Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0?blockid=flDSiWkkQBJ8C%2BfoxhZiZA%3D%3D&comp=block&timeout=1801": write tcp 10.100.0.69:48046->20.60.250.33:443: write: connection reset by peer]
Tue Aug 22 08:45:55 UTC 2023 : blobfuse2[164407] : LOG_ERR [utils.go (234)]: SDK : ==> REQUEST/RESPONSE (Try=1/4.199146ms, OpTime=23.098575ms) -- REQUEST ERROR
   PUT https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0?blockid=flDSiWkkQBJ8C%2BfoxhZiZA%3D%3D&comp=block&timeout=1801
   Authorization: REDACTED
   Content-Length: [4194304]
   User-Agent: [Azure-Storage-Fuse/2.0.5 (Oracle Linux Server 8.7) Azure-Storage/0.15 (go1.20.5; linux)]
   X-Ms-Client-Request-Id: [96826930-7241-43cc-4cfc-7899b12ab247]
   X-Ms-Version: [2020-10-02]
   x-ms-date: [Tue, 22 Aug 2023 08:45:55 GMT]
   --------------------------------------------------------------------------------
   ERROR:
-> github.com/Azure/azure-pipeline-go/pipeline.NewError, /home/cloudtest/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.4-0.20220425205405-09e6f201e1e4/pipeline/error.go:157
HTTP request failed

Put "https://pacacust02weassetse6dfb2.blob.core.windows.net/pacacust02wetestorablob-backup/step/0/2023-08-21_1125/STEP_102675_1_8j24eigq_20230822.incr0?blockid=flDSiWkkQBJ8C%2BfoxhZiZA%3D%3D&comp=block&timeout=1801": write tcp 10.100.0.69:48046->20.60.250.33:443: write: connection reset by peer

goroutine 1387 [running]:
github.com/Azure/azure-storage-azcopy/v10/ste.stack()
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-storage-azcopy/v10@v10.19.1-0.20230717101935-ab8ff0a85e48/ste/xferLogPolicy.go:237 +0x65
github.com/Azure/azure-storage-azcopy/v10/ste.NewRequestLogPolicyFactory.func1.1({0x100dbf8, 0xc00147d110}, {0xc001339520?})
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-storage-azcopy/v10@v10.19.1-0.20230717101935-ab8ff0a85e48/ste/xferLogPolicy.go:151 +0xa29
github.com/Azure/azure-pipeline-go/pipeline.PolicyFunc.Do(0x416fd7?, {0x100dbf8?, 0xc00147d110?}, {0x7f8df8358a68?})
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.4-0.20220425205405-09e6f201e1e4/pipeline/core.go:43 +0x2f
github.com/Azure/azure-storage-blob-go/azblob.responderPolicy.Do({{0x1007680?, 0xc0006180a0?}, 0xc00110db80?}, {0x100dbf8?, 0xc00147d110?}, {0xc000562420?})
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.15.0/azblob/zz_generated_responder_policy.go:33 +0x44
github.com/Azure/azure-storage-blob-go/azblob.(*SharedKeyCredential).New.func1({0x100dbf8, 0xc00147d110}, {0x1a3185c5000?})
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.15.0/azblob/zc_credential_shared_key.go:66 +0x3c3
github.com/Azure/azure-pipeline-go/pipeline.PolicyFunc.Do(0x100db88?, {0x100dbf8?, 0xc00147d110?}, {0x7?})
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.4-0.20220425205405-09e6f201e1e4/pipeline/core.go:43 +0x2f
github.com/Azure/azure-storage-azcopy/v10/ste.NewBlobXferRetryPolicyFactory.func1.1({0x100db88, 0xc000ecf130}, {0xc000562160?})
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-storage-azcopy/v10@v10.19.1-0.20230717101935-ab8ff0a85e48/ste/xferRetrypolicy.go:383 +0xa57
github.com/Azure/azure-pipeline-go/pipeline.PolicyFunc.Do(0xdf5240?, {0x100db88?, 0xc000ecf130?}, {0x16?})
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.4-0.20220425205405-09e6f201e1e4/pipeline/core.go:43 +0x2f
github.com/Azure/azure-storage-blob-go/azblob.NewUniqueRequestIDPolicyFactory.func1.1({0x100db88, 0xc000ecf130}, {0xc000562160?})
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.15.0/azblob/zc_policy_unique_request_id.go:22 +0x191
github.com/Azure/azure-pipeline-go/pipeline.PolicyFunc.Do(0xdf5240?, {0x100db88?, 0xc000ecf130?}, {0xa?})
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.4-0.20220425205405-09e6f201e1e4/pipeline/core.go:43 +0x2f
github.com/Azure/azure-storage-blob-go/azblob.NewTelemetryPolicyFactory.func1.1({0x100db88, 0xc000ecf130}, {0xc000a42870?})
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.15.0/azblob/zc_policy_telemetry.go:34 +0x11c
github.com/Azure/azure-pipeline-go/pipeline.PolicyFunc.Do(0xc0005da100?, {0x100db88?, 0xc000ecf130?}, {0x0?})
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.4-0.20220425205405-09e6f201e1e4/pipeline/core.go:43 +0x2f
github.com/Azure/azure-pipeline-go/pipeline.(*pipeline).Do(0xc00003c440?, {0x100db88, 0xc000ecf130}, {0x1007740?, 0xc00110db80?}, {0xc00003c448?})
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.4-0.20220425205405-09e6f201e1e4/pipeline/core.go:129 +0x54
github.com/Azure/azure-storage-blob-go/azblob.blockBlobClient.StageBlock({{{{0xc00003c440, 0x5}, {0x0, 0x0}, 0x0, {0xc00003c448, 0x2e}, {0xc001167500, 0x5c}, {0x0, ...}, ...}, ...}}, ...)
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.15.0/azblob/zz_generated_block_blob.go:509 +0x478
github.com/Azure/azure-storage-blob-go/azblob.BlockBlobURL.StageBlock({{{{{...}, {...}}}}, {{{{...}, {...}, 0x0, {...}, {...}, {...}, _, _, ...}, ...}}}, ...)
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.15.0/azblob/url_block_blob.go:97 +0x1cc
github.com/Azure/azure-storage-blob-go/azblob.uploadReaderAtToBlockBlob.func1(0x2800000, 0x400000, {0x100db88, 0xc000ecf130})
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.15.0/azblob/highlevel.go:136 +0x3e5
github.com/Azure/azure-storage-blob-go/azblob.DoBatchTransfer.func2()
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.15.0/azblob/highlevel.go:347 +0x35
github.com/Azure/azure-storage-blob-go/azblob.DoBatchTransfer.func1()
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.15.0/azblob/highlevel.go:331 +0x44
created by github.com/Azure/azure-storage-blob-go/azblob.DoBatchTransfer
        /home/cloudtest/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.15.0/azblob/highlevel.go:329 +0x250

Tue Aug 22 08:45:55 UTC 2023 : blobfuse2[164407] : LOG_INFO [utils.go (238)]: SDK : ==> REQUEST/RESPONSE (Try=1/94.766006ms, OpTime=103.471821ms) -- RESPONSE SUCCESSFULLY RECEIVED

@mortenjoenby
Copy link
Author

blobfuse2_20230822.zip

@vibhansa-msft
Copy link
Member

Enabling the SDK-Trace will help if backend returns back with some specific error code and message that shows signs of throttling in the backend. If the connect just gets aborted then we will not get much info from the logs, in above errors I see "connection reset by peer" which means we will not receive any error code from storage, and it's just disconnected abruptly. This might be due to various reasons and at blobfuse end we cannot assume throttling being the reason. If backend team is saying its throttled, then they have logs to prove that nothing from blobfuse end we can debug here.

When you say with Files things are working fine, that may be due to a different backend storage all together. Files and blobs cannot be compared as they are handled differently in the backend. If we are observing here that with blobfuse its getting throttled then you can try other tools like AzCopy to download good volume of data from stoarge account and check if that gets throttled as well or not (just to validate this theory of throttling.) If you increase some of the caching timeouts, then it will result back in lesser calls to storage and that might save some of these throttling errors.

@mortenjoenby
Copy link
Author

@vibhansa-msft.
We are getting confused to be honest ... So running the above setup with 2 RMAN channels and max-concurrency: 16 succeeded.
Then ... going back to default (1 RMAN channel - max-concurrency: 32) ... Succeeded ...!?!?
There's no consistency.

What I do see with the default today vs. Friday is that we are not getting the same throughput. Friday ~1420 MBps. Today ~1250 MBps.

@mortenjoenby
Copy link
Author

@vibhansa-msft, it's really strange and honestly very confusing.
Yesterday I was 110% sure we found the problem, but then we ran tests with 64 threads, and also tests with RMAN running two channels (parallel backups).
We were NOT able to get it to fail .... Since last night we have run the default setup with 32 threads over and over again. Not a single error.
Friday was the same setup and we had errors ALL the time.
I don't get it, and I will request (through our CSAM) that the Azure Support team try to work with you guys in determining the problem was (is?).

@vibhansa-msft
Copy link
Member

When you run with all these different config, do we check on the graphs as to where we reach in terms of bandwidth utilization.
For higher number of threads like 64 or more it is expected. You have 32 core VM and trying to run 64 threads in parallel to do upload/download job. Here resource contention will be there, and threads have to wait to get the required resources and in effect it will impact the overall performance adversely. For 32 threads where we were hitting the error early, we shall recheck the graph it still hits the bandwidth upper limit or not.

@vibhansa-msft
Copy link
Member

When we say it failed it shall be corelated back to the overall bandwidth usage or backend throttling kicking in. Sometimes even the same config may not hit exactly the same issue if responses from backend is slow as we will not be able to post the next requests at the same pace.

@mortenjoenby
Copy link
Author

When you run with all these different config, do we check on the graphs as to where we reach in terms of bandwidth utilization. For higher number of threads like 64 or more it is expected. You have 32 core VM and trying to run 64 threads in parallel to do upload/download job. Here resource contention will be there, and threads have to wait to get the required resources and in effect it will impact the overall performance adversely. For 32 threads where we were hitting the error early, we shall recheck the graph it still hits the bandwidth upper limit or not.

We started checking the graphs later, but according to Azure Support (when asking them to check yesterday) they say that we are NOT hitting the VM bandwidth limit.

The VM we are using is a D8ds_v5 with 8 vCPU's. Not 32.

@vibhansa-msft
Copy link
Member

So in that case we shall keep the max-concurrency values to lower only may be 8 or even lower. One thing you can try out is to keep this value to '6' and block-size you can increase to 16MB to reduce the number of independent requests going to storage.
Also, add that new ignore-sync thing that I shared as a private binary.

@mortenjoenby
Copy link
Author

@vibhansa-msft, what's the background for suggesting this change when everything seems to be running fine (except for when it doesn't :)) ?
And we already increased the block size to 16MB on your request Wednesday last week.
I will test the "--ignore-sync" option later today or tomorrow, but we have a meeting later today (also with you). I don't want to change anything before that.

@vibhansa-msft
Copy link
Member

yes that's fine, I was suggesting based on the VM config that says you have only 8 vCPUs so even having 8 threads may create resource contention at times.

@mortenjoenby
Copy link
Author

It's also our own "feeling" that this makes sense.
But we are looking for some kind of explanation to the problems :)

@vibhansa-msft
Copy link
Member

Agree, but this is beyond Blobfuse control as well. Network needs to provide inputs on what has changed or why we are not able to hit those numbers again as from blobfuse end its the same config again. One way to boost performance from our side is to disable sdk-trace and reset log-level to LOG_ERR. This will reduce the logging overhead and put more stress on the network and you might hit the same network limits again. Though this is just for experiment and does not prove much.

@mortenjoenby
Copy link
Author

@vibhansa-msft, just installed the preview and mounted with the --ignore-sync option.
Backup is running and I am trying with a timeout-sec of 7200 seconds.
I will let you know if it works - "restore validate" should avoid download to cache.

@vibhansa-msft
Copy link
Member

Any interesting find with the new binary and ignore-sync option?

@mortenjoenby
Copy link
Author

Hi @vibhansa-msft .
Files are kept in cache as shown:

[oracle@pacacust02testord52941c5-db-1 blobfuse]$ ls -ltr /mnt/blobfusetmp/step/0/2023-09-06_1835/
total 119533752
-rw-r--r--. 1 oracle dba         21 Sep  6 18:35 resume-backup-2023-09-06_1835-incr0-4255136787
-rw-r-----. 1 oracle dba  461635584 Sep  6 18:35 STEP_111062_1_em25n6mt_20230906.incr0
-rw-r-----. 1 oracle dba   34668544 Sep  6 18:35 STEP_111063_1_en25n6n4_20230906.incr0
-rw-r-----. 1 oracle dba   34668544 Sep  6 18:35 STEP_111064_1_eo25n6n7_20230906.incr0
-rw-r-----. 1 oracle dba   34668544 Sep  6 18:35 STEP_111065_1_ep25n6na_20230906.incr0
-rw-r-----. 1 oracle dba 4296073216 Sep  6 18:35 STEP_111066_1_eq25n6nh_20230906.incr0
-rw-r-----. 1 oracle dba    1843200 Sep  6 18:35 STEP_111067_1_er25n6oa_20230906.incr0
-rw-r-----. 1 oracle dba 8582643712 Sep  6 18:36 STEP_111068_1_es25n6od_20230906.incr0
-rw-r-----. 1 oracle dba 8254603264 Sep  6 18:37 STEP_111069_1_et25n6pq_20230906.incr0
-rw-r-----. 1 oracle dba 8255496192 Sep  6 18:37 STEP_111070_1_eu25n6r8_20230906.incr0
-rw-r-----. 1 oracle dba 8255496192 Sep  6 18:38 STEP_111071_1_ev25n6sv_20230906.incr0
-rw-r-----. 1 oracle dba 8582643712 Sep  6 18:39 STEP_111072_1_f025n6uc_20230906.incr0
-rw-r-----. 1 oracle dba 8255496192 Sep  6 18:40 STEP_111073_1_f125n6vp_20230906.incr0
-rw-r-----. 1 oracle dba 8582643712 Sep  6 18:40 STEP_111074_1_f225n716_20230906.incr0
-rw-r-----. 1 oracle dba 8255496192 Sep  6 18:41 STEP_111075_1_f325n72j_20230906.incr0
-rw-r-----. 1 oracle dba 8582643712 Sep  6 18:42 STEP_111076_1_f425n740_20230906.incr0
-rw-r-----. 1 oracle dba 8255496192 Sep  6 18:43 STEP_111077_1_f525n75d_20230906.incr0
-rw-r-----. 1 oracle dba 8582643712 Sep  6 18:43 STEP_111078_1_f625n76q_20230906.incr0
-rw-r-----. 1 oracle dba 8255496192 Sep  6 18:44 STEP_111079_1_f725n787_20230906.incr0
-rw-r-----. 1 oracle dba 8582643712 Sep  6 18:45 STEP_111080_1_f825n79k_20230906.incr0
-rw-r-----. 1 oracle dba 8255496192 Sep  6 18:46 STEP_111081_1_f925n7b1_20230906.incr0

But it seems that files are still downloaded again for the "restore validation" part.
But, the backup in this case consists of more than 55 files each 8GB. The cache disk (/mnt) is 300GB, meaning that high threshold at 80% is ~270 GB. Divided by 8GB that's ~33 files. So there's no way the current backup that we are doing will ever fit completely in cache. So to test that part properly I will need a smaller database/backup that I know will fit within the cache size.

There's so much logging, so difficult to find out if some of the files are served directly from cache and not downloaded again.

But thinking about it, then when doing the restore validate, it will start to download what's not in cache, but this will also make cache run full and then evict the oldest (?? lru ??) files.

Files with timestamp 19:19 were the last files being backed up, but the re-downloaded files (18:35-18:57) accumulates quickly and then the oldest from backup (19:19) gets evicted:

[oracle@pacacust02testord52941c5-db-1 blobfuse]$ ls -ltr /mnt/blobfusetmp/step/0/2023-09-06_1835/
total 242773800
-rwxr-xr-x. 1 oracle dba  461635584 Sep  6 18:35 STEP_111062_1_em25n6mt_20230906.incr0
-rwxr-xr-x. 1 oracle dba   34668544 Sep  6 18:35 STEP_111063_1_en25n6n4_20230906.incr0
-rwxr-xr-x. 1 oracle dba   34668544 Sep  6 18:35 STEP_111064_1_eo25n6n7_20230906.incr0
-rwxr-xr-x. 1 oracle dba   34668544 Sep  6 18:35 STEP_111065_1_ep25n6na_20230906.incr0
-rwxr-xr-x. 1 oracle dba 4296073216 Sep  6 18:35 STEP_111066_1_eq25n6nh_20230906.incr0
-rwxr-xr-x. 1 oracle dba    1843200 Sep  6 18:35 STEP_111067_1_er25n6oa_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8582643712 Sep  6 18:36 STEP_111068_1_es25n6od_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8254603264 Sep  6 18:37 STEP_111069_1_et25n6pq_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:38 STEP_111070_1_eu25n6r8_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:39 STEP_111071_1_ev25n6sv_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8582643712 Sep  6 18:39 STEP_111072_1_f025n6uc_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:40 STEP_111073_1_f125n6vp_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8582643712 Sep  6 18:41 STEP_111074_1_f225n716_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:42 STEP_111075_1_f325n72j_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8582643712 Sep  6 18:42 STEP_111076_1_f425n740_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:43 STEP_111077_1_f525n75d_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8582643712 Sep  6 18:44 STEP_111078_1_f625n76q_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:45 STEP_111079_1_f725n787_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8582643712 Sep  6 18:45 STEP_111080_1_f825n79k_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:46 STEP_111081_1_f925n7b1_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:47 STEP_111082_1_fa25n7ce_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:48 STEP_111083_1_fb25n7e5_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:49 STEP_111084_1_fc25n7fi_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:49 STEP_111085_1_fd25n7h9_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:50 STEP_111086_1_fe25n7im_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:51 STEP_111087_1_ff25n7ke_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 18:52 STEP_111088_1_fg25n7lr_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255004672 Sep  6 18:53 STEP_111089_1_fh25n7ni_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8582643712 Sep  6 18:54 STEP_111090_1_fi25n7ov_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8582643712 Sep  6 18:54 STEP_111091_1_fj25n7qc_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8582643712 Sep  6 18:55 STEP_111092_1_fk25n7rp_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8582643712 Sep  6 18:56 STEP_111093_1_fl25n7t6_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8582643712 Sep  6 18:57 STEP_111094_1_fm25n7uj_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8582643712 Sep  6 18:57 STEP_111095_1_fn25n800_20230906.incr0
-rw-r-----. 1 oracle dba 1002659840 Sep  6 19:19 STEP_111128_1_go25n9al_20230906.incr0
-rw-r-----. 1 oracle dba    1105920 Sep  6 19:19 STEP_111129_1_gp25n9as_20230906.incr0
-rw-r-----. 1 oracle dba     114688 Sep  6 19:20 STEP_111132_1_gs25n9b6_20230906.spfile
-rw-r-----. 1 oracle dba   68943872 Sep  6 19:20 STEP_111133_1_gt25n9b7_20230906.ctl
-rw-r-----. 1 oracle dba   68861952 Sep  6 19:20 ctlfileback_step_2023-09-06_1835.bin
-rw-r--r--. 1 oracle dba      15480 Sep  6 19:20 ctlfileback_step_2023-09-06_1835.txt
-rw-r--r--. 1 oracle dba       1531 Sep  6 19:20 initstep.ora
-rw-r--r--. 1 oracle dba      12684 Sep  6 19:20 rman_restore_script_step_2023-09-06_1835.rcv
-rw-r--r--. 1 oracle dba    1523265 Sep  6 19:20 alert_step.log
-rw-r--r--. 1 oracle dba      15182 Sep  6 19:20 filelist_step_2023-09-06_1835.lst
-rw-r--r--. 1 oracle dba 8582643712 Sep  6 19:27 STEP_111096_1_fo25n81d_20230906.incr0

After eviction:

[oracle@pacacust02testord52941c5-db-1 blobfuse]$ ls -ltr /mnt/blobfusetmp/step/0/2023-09-06_1835/
total 89066420
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 19:00 STEP_111098_1_fq25n847_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 19:00 STEP_111099_1_fr25n85u_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8582643712 Sep  6 19:01 STEP_111100_1_fs25n87b_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 19:02 STEP_111101_1_ft25n88o_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 19:03 STEP_111102_1_fu25n8a5_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 19:04 STEP_111103_1_fv25n8bt_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8322605056 Sep  6 19:04 STEP_111104_1_g025n8da_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255143936 Sep  6 19:05 STEP_111105_1_g125n8f1_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255217664 Sep  6 19:06 STEP_111106_1_g225n8ge_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255496192 Sep  6 19:07 STEP_111107_1_g325n8i5_20230906.incr0
-rwxr-xr-x. 1 oracle dba 8255381504 Sep  6 19:08 STEP_111108_1_g425n8ji_20230906.incr0

@mortenjoenby
Copy link
Author

But I am pretty sure '--ignore-sync' works as expected :)

@mortenjoenby
Copy link
Author

Hi @vibhansa-msft.
I believe it works - when using a database backup that fits 100% in cache.
These are the files in cache. Backup started at 14:55:45.
Restore validation starts at 15:08:47 and according to logs it looks to me as files are served from cache and not re-downloaded.

[oracle@pacacust02testord52941c5-db-1 blobfuse]$ ls -ltr /mnt/blobfusetmp/step/0/2023-09-07_1455/
total 110807944
-rw-r-----. 1 oracle dba 17172578304 Sep  7 14:57 STEP_60544_1_4025pea4_20230907.incr0
-rw-r-----. 1 oracle dba 17172062208 Sep  7 14:59 STEP_60545_1_4125ped4_20230907.incr0
-rw-r-----. 1 oracle dba 17170587648 Sep  7 15:01 STEP_60546_1_4225pegd_20230907.incr0
-rw-r-----. 1 oracle dba 16620281856 Sep  7 15:02 STEP_60547_1_4325pejc_20230907.incr0
-rw-r-----. 1 oracle dba  7265992704 Sep  7 15:03 STEP_60548_1_4425pemb_20230907.incr0
-rw-r-----. 1 oracle dba  8195833856 Sep  7 15:04 STEP_60549_1_4525peno_20230907.incr0
-rw-r-----. 1 oracle dba   286998528 Sep  7 15:05 STEP_60550_1_4625pep5_20230907.incr0
-rw-r-----. 1 oracle dba  7614529536 Sep  7 15:05 STEP_60551_1_4725pepc_20230907.incr0
-rw-r-----. 1 oracle dba  6507225088 Sep  7 15:06 STEP_60552_1_4825peqp_20230907.incr0
-rw-r-----. 1 oracle dba  6236684288 Sep  7 15:06 STEP_60553_1_4925pert_20230907.incr0
-rw-r-----. 1 oracle dba  4275322880 Sep  7 15:07 STEP_60554_1_4a25pet0_20230907.incr0
-rw-r-----. 1 oracle dba  2143371264 Sep  7 15:07 STEP_60555_1_4b25petp_20230907.incr0
-rw-r-----. 1 oracle dba  1155284992 Sep  7 15:07 STEP_60556_1_4c25peu8_20230907.incr0
-rw-r-----. 1 oracle dba   589234176 Sep  7 15:08 STEP_60557_1_4d25peuf_20230907.incr0
-rw-r-----. 1 oracle dba   982646784 Sep  7 15:08 STEP_60558_1_4e25peum_20230907.incr0
-rw-r-----. 1 oracle dba     1105920 Sep  7 15:08 STEP_60559_1_4f25peut_20230907.incr0
-rw-r-----. 1 oracle dba      114688 Sep  7 15:08 STEP_60575_1_4v25pev9_20230907.spfile
-rw-r-----. 1 oracle dba    37814272 Sep  7 15:08 STEP_60576_1_5025peva_20230907.ctl
-rw-r-----. 1 oracle dba    37732352 Sep  7 15:08 ctlfileback_step_2023-09-07_1455.bin
-rw-r--r--. 1 oracle dba        8956 Sep  7 15:08 ctlfileback_step_2023-09-07_1455.txt
-rw-r--r--. 1 oracle dba        1531 Sep  7 15:08 initstep.ora
-rw-r--r--. 1 oracle dba        8095 Sep  7 15:08 rman_restore_script_step_2023-09-07_1455.rcv
-rw-r--r--. 1 oracle dba     1694228 Sep  7 15:08 alert_step.log
-rw-r--r--. 1 oracle dba        6810 Sep  7 15:08 filelist_step_2023-09-07_1455.lst
-rw-r--r--. 1 oracle dba      145611 Sep  7 15:11 pacacust02testord52941c5-db-1.backend.mdm.stibosystems.com_step_incr0_2023-09-07_1455_2386884.log

blobfuse2_using-cache_20230907.zip
pacacust02testord52941c5-db-1.backend.mdm.stibosystems.com_step_incr0_2023-09-07_1455_2386884.log

@vibhansa-msft
Copy link
Member

In these logs I do not see any hit of "connection reset" or any sort of failure where blobfuse needed to retry. Looks good to me with the current set of changes. Anything unnatural that you observe during this run?

@vibhansa-msft
Copy link
Member

I also see logs like below which indicate we did serve the files from local cache instead of redownloading them,

Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_DEBUG [file_cache.go (928)]: FileCache::OpenFile : step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc will be served from cache

@vibhansa-msft
Copy link
Member

Exact sequence that we expected for a file that was synced and still not removed from cache so that it can be later reused:

Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_TRACE [libfuse_handler.go (1040)]: Libfuse::libfuse_fsync : step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc, handle: 6
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_TRACE [file_cache.go (1096)]: FileCache::SyncFile : handle=6, path=step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_TRACE [libfuse_handler.go (820)]: Libfuse::libfuse_truncate : step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc size 10496908288
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_TRACE [file_cache.go (1330)]: FileCache::TruncateFile : name=step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc, size=10496908288
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_TRACE [file_cache.go (832)]: FileCache::OpenFile : name=step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc, flags=2, mode=-rwxr-xr-x
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_TRACE [lru_policy.go (201)]: lruPolicy::IsCached : /mnt/blobfusetmp/step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_DEBUG [lru_policy.go (206)]: lruPolicy::IsCached : /mnt/blobfusetmp/step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc, deleted:false
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_INFO [file_cache.go (795)]: FileCache::isDownloadRequired : Need to re-download step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc, but skipping as handle is already open
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_DEBUG [file_cache.go (928)]: FileCache::OpenFile : step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc will be served from cache
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_INFO [file_cache.go (953)]: FileCache::OpenFile : file=step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc, fd=23
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_TRACE [file_cache.go (961)]: FileCache::CloseFile : name=step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc, handle=0
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_INFO [file_cache.go (966)]: FileCache::CloseFile : name=step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc, handle=0 dirty. Flushing the file.
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_TRACE [file_cache.go (1131)]: FileCache::FlushFile : handle=0, path=step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_TRACE [attr_cache.go (416)]: AttrCache::CopyFromFile : step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_TRACE [attr_cache.go (467)]: AttrCache::GetAttr : step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc
Thu Sep  7 14:55:47 UTC 2023 : blobfuse2[2386428] : LOG_DEBUG [attr_cache.go (486)]: AttrCache::GetAttr : step/archivelog/2023-09-07_1455_STEP_60542_1_3u25pe7j_20230907.arc served from cache

@mortenjoenby
Copy link
Author

In these logs I do not see any hit of "connection reset" or any sort of failure where blobfuse needed to retry. Looks good to me with the current set of changes. Anything unnatural that you observe during this run?

I think it works as expected. Did another run with timeout-sec set to 60 seconds to force eviction. I have attached the files.
Can you verify please?
The difference though in the time is not very big though, but of course it will save some API calls.
As far as I can tell from the RMAN log when doing restore validation is, that a 16GB file took 25 seconds to validate when served from cache and 31 seconds when NOT in cache. So 6 seconds more ... That's not a lot. Not sure it's 100% correct though as downloading 16GB in 6 seconds sounds incorrect. But perhaps I can't see it 100% like that?
Perhaps you can explain from logs?

P.S. I added you to the support case mail.

pacacust02testord52941c5-db-1.backend.mdm.stibosystems.com_step_incr0_2023-09-07_1548_2412257.log
blobfuse2_noCache_20230905.zip

@mortenjoenby
Copy link
Author

At least I believe we can conclude that ignore-sync works and you can release that.
Will you do that soon then?

@vibhansa-msft
Copy link
Member

Yes I have already put it up in a PR and in next release this feature will be made public.

@vibhansa-msft
Copy link
Member

For the '60' second timeout part I can see cache clearance kicking in every minute. If you wish to validate that part you can search for below log in the logs file and see the time mentioned for that log line.

Thu Sep  7 15:48:25 UTC 2023 : blobfuse2[2411843] : LOG_TRACE [lru_policy.go (287)]: lruPolicy::Clear-timeout monitor

At times this line is followed by another line that says there are 1 or 2 files which can be evicted due to this kick-in and it means logic is working as expected.
I do not see any disk high usage hit related logs, so we are able to manage within the available disk space well.
For open calls I see most of the logs either say its not their in local cache or if its there its not going to redownload.
All these points to a sane behaviour from Blobfuse end. Do not see any concern from the logs. No signs of "connection reset" or any sort of "retries".
Appears all green from my end.

@mortenjoenby
Copy link
Author

Yes I have already put it up in a PR and in next release this feature will be made public.

@vibhansa-msft , when should I expect 2.1.1 to be released?

@vibhansa-msft
Copy link
Member

As of now exact date is not planned. We have just released 2.1.0 so you can expect alteast 2 months before the next release.
Only thing we can do is give you a private binary if you wish.

@mortenjoenby
Copy link
Author

As of now exact date is not planned. We have just released 2.1.0 so you can expect alteast 2 months before the next release. Only thing we can do is give you a private binary if you wish.

@vibhansa-msft, we have the 2.1.1-preview.1. I guess we can use that? no?
Otherwise, as mentioned, the effect was not huge, so we could also just use 2.1.0 for now.

@mortenjoenby
Copy link
Author

One last question, on the health-monitor.
I can't seem to find a way to control how many json files are generated and how big they are? How do you clean them up?
And do you have some examples on usage?

@vibhansa-msft
Copy link
Member

@souravgupta-msft can you provide required info on the health-monitor here.
@mortenjoenby can you provide some info on how you plan to use health-monitor or what stats you wish to measure using this.

@souravgupta-msft
Copy link
Member

@mortenjoenby, you can refer the health monitor config from here. I would like to mention that network profiler is not currently implemented, and health monitor is a preview feature.
Regarding the output json files, the max size of each file is 10MB and its count is set to 10. The output files are rotated when the size of the latest file becomes more than 10MB, and the oldest file is deleted (if the count of files becomes more than 10). The size and count are not configurable as of now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment