-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors while uploading #1257
Comments
also pinging @AlmightyYakob as there may be both server and cli interactions at play. |
could you please share the log file which should have been mentioned at the end of the process? May be would help to see what is not implemented and thus causing 501. |
I can't see any meaningful information in the log file besides the snippets I have pasted above, I'm attaching it here anyways! |
I see two types of errors in the log file:
@gmazzamuto What type of filesystem are you uploading from? Do the errors happen again if you upload again? |
The files are not modified during the upload. I am uploading from an NFS share. When I reuploaded, there were no errors. The upload I am doing right now is showing two 400 errors and one 501 error. |
NFS can be fun, depending on the enthusiasm (or lack of such) of admins ;) At some point even, to reduce "fun", I just made it all plain |
Hi Yarik, the files where created several hours before the upload. They are stored on a NAS server with RAID and accessed through an NFS share. Do you think the problems I am seeing are due to the network share? I've never noticed data corruption before. During the last upload I got several errors, including a 501 but with a slightly different stack trace:
|
@jwodder what about those "not implemented" 501s coming from AWS -- what are those exactly about and isn't it us who submit those requests? |
@yarikoptic As before, it's the same problem as #1033. |
In that #1033 it also was NFS over beegfs, so so far it is some common aspect here. Do you know your underlying NAS server filesystem?
it might be not necessarily "corruption" but some other odd behavior/delayed metadata propagation. E.g. in #1033 (comment) @jwodder hypothesizes that may be some inability to get file size reported forces requests to
@jwodder believes (and I trust him on that) that it is likely that issue of chunked encoding described in #1033 (comment) and cited above. @jwodder - do you mean this https://github.com/psf/requests/blob/HEAD/requests/models.py#L549 or some other location? how could we instrument in our code to discover more about the situation better? |
@yarikoptic That's one of the location involved, yes. What exactly do you want to discover? |
ultimately -- resolution to this issue so that there is no error during uploading. For that IMHO we need to troubleshoot further and instrument to gather
|
Found another old NFS related issue: #764 |
@yarikoptic Could you be more precise about exactly what sort of instrumentation you want? Your first bullet point sounds like you want to detect & report when |
correct
in the 2nd point I
|
@yarikoptic Yes, we are retrying on the 501 errors. For example, grepping for one of the failed upload URLs from the logs in this comment gives:
So, if & when the 501 error occurs (only for the first time for a given uploaded file?), you want dandi-cli to try to get the filesize the same way |
NB I had a question are we "retrying at a proper level" or just sending the "bad" `Content-Length` over and over again? Seems to be yes:here is the stack for the point where
so the
if we can get the file size there then it is one of those 3 cases I listed which could be at play here. Most likely imho is that it is some other cause than size getting. But to entirely rule it out -- we would need to overload
which is what is used by requests, and instrument to log the "body" (what file it is) and returned size. This way we would discover for sure if that is somehow size to blame. What if we start from there and add instrumentation of that function to be triggered by e.g. |
@gmazzamuto would you be so kind to try with fresh 0.52.0 release which would log more information about failing requests -- may be it would give us more clue on what is going on. |
Log results of `requests.utils.super_len()` when `DANDI_DEVEL_INSTRUMENT_REQUESTS_SUPERLEN` is set
@gmazzamuto , if you could install from GitHub master branch, and run with |
Sure, I can give it a try. But I have completed my uploads for now, there are only some small files left (photos and json sidecar files). I guess it's more likely to happen with larger files, but I can try anyways |
any odd behaviors you were (un)lucky to run into @gmazzamuto ? |
Hi Yarik, I haven't had the chance to upload new data yet, I will try in the coming days then I'll let you know how it goes! |
no sorry, my fault: I saw a new available version of dandi-cli and installed that without checking if it contained the relevant commit. I'll send the log using the correct version of the client |
Hi all, here is the latest log: |
I see the following problems in the log:
Conclusion: NFS is a liar. |
"coolio!" another question before we proceed: @jwodder do you think it is pertinent to only size 0 or might be for other occasions that NFS reports wrong size first -- would some check/code fail if file size changes during upload there from the moment of initial call to |
That depends on how the S3 server is implemented. The HTTP standard doesn't seem to specify a behavior when a request has an inaccurate |
@gmazzamuto could you share details of kernel version ( |
S3 already checks that the uploaded files match the MD5 checksums that the client provided before starting the upload. If a
Is this a change to the current loop when spying on |
right, but if initial (before upload commenced) read
for you to decide, but I think the |
Yes there seems to be some problems with NFS... However only two of the exceptions seem to be related to the
|
oh -- indeed! And 3 errors point to the same expected and calculated one❯ grep 'The Content-MD5 you specified did' log.txt | sort
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>4415dc11527da709bcb9729ce68c642b</ExpectedDigest><CalculatedDigest>fQafc/jEmCnHsiz5Q1YzJw==</CalculatedDigest><RequestId>J6XF28CCXZM6FBFP</RequestId><HostId>IzKVTFkEhPhHVH+EI5a73d+EnjZbHImw2EcCXgO8mdd8O20IDYdTf90/WdZqiz0AMHIUuCTFzVs=</HostId></Error>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>4415dc11527da709bcb9729ce68c642b</ExpectedDigest><CalculatedDigest>fQafc/jEmCnHsiz5Q1YzJw==</CalculatedDigest><RequestId>J6XF28CCXZM6FBFP</RequestId><HostId>IzKVTFkEhPhHVH+EI5a73d+EnjZbHImw2EcCXgO8mdd8O20IDYdTf90/WdZqiz0AMHIUuCTFzVs=</HostId></Error>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>4415dc11527da709bcb9729ce68c642b</ExpectedDigest><CalculatedDigest>fQafc/jEmCnHsiz5Q1YzJw==</CalculatedDigest><RequestId>J6XF28CCXZM6FBFP</RequestId><HostId>IzKVTFkEhPhHVH+EI5a73d+EnjZbHImw2EcCXgO8mdd8O20IDYdTf90/WdZqiz0AMHIUuCTFzVs=</HostId></Error>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>823a57a8016c1d44d24794c7399e0ca9</ExpectedDigest><CalculatedDigest>gv2hsINH/+jlNy0QpKkeLw==</CalculatedDigest><RequestId>HDJQP9RNYP68WF4K</RequestId><HostId>SBMp22+m3Mh0pGpuZe0UXpZE8KmHWopgSNxgXMCOcFj1DciCu9R/6lx3RK8RdDZbz2EGLvxo/CU=</HostId></Error>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>8963fa526c5be1e0f57b95ffd2e9517c</ExpectedDigest><CalculatedDigest>z2Pn5UdgjoSjNHmg28/H9Q==</CalculatedDigest><RequestId>6B8D87Z9MP1E9PNX</RequestId><HostId>a8TVjSov3SapE8lhR3O3tSkXIrZ0EB6qJTCaCl/K4u2rKfySDjUey/txKBl1osjBUW85jKWDg3E=</HostId></Error>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>ab1741f39527631526ea36d12924ae0c</ExpectedDigest><CalculatedDigest>yfoqltT4y8YiKeOtOkiPAg==</CalculatedDigest><RequestId>2CJE832QVMET5K1G</RequestId><HostId>2D6S5wTB4d459i2Nc2FW9kMa0T4Fm7JparIFluz1E+088UtUArc0Kush8QmvUKmjtnlccp6Pxtw=</HostId></Error>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>fa994fde9c8f7863ecbdaa34cd9b52d4</ExpectedDigest><CalculatedDigest>JRpvUxoiyfX7dKr6pB3t3w==</CalculatedDigest><RequestId>7VE63P018K6BCNSC</RequestId><HostId>F4snlJfV9fA+3fOUE2ISiRIdxWSmGQGtfxV0z2dEzWQFnBFPZYfcl/ouwxPEdXFCQUWdG9MVgY0=</HostId></Error>
not sure why it reports calculated in base64 while original in hex, but here is detail on first instance
of such... I wonder if it could be for that reason of incorrect size reported first? I guess could be tested by incrementally passing longer and longer portions of the file and see if we hit that "ExpectedDigest" (md5 since single part upload?) |
I don't think the size reported by
The |
Given that we see those other "digest based" errors too, I personally would not make any bets on interplay between stat's size and IO reads... adding a check is very cheap, so why not to do it? By any chance do you see anything suggesting of size changes already in those logs we have associated with MD5 digests mismatch?
I think so, and making it non-conditional on that dedicated env var |
So what exactly is the client supposed to check after uploading a file? Should it check that the size of the local file is the same as at the start of the upload? I don't really see the value in that.
Other than the digest mismatches themselves, no.
Do you mean that you want |
It would have (without our extra checks for size 0) caught that size change (e.g. from 0) at the beginning to some other value. And that is what I want us to do here -- to add such a trivial and basic check, so if we get md5 digest mismatch, depending on either we get this error too we would know if it was merely due to size change or something else.
yes, if that is the easiest way to provision check of the size to get consistent reading for |
So, if & only if there's a digest mismatch (which currently already causes an exception), the upload code should stat the uploaded file and check whether its size is the same as at the beginning of the upload? If it's not the same, should that result in a log message or an alternative exception or what? Side note: When uploading asset blobs (but not Zarr entries), the client computes a digest before uploading, sends the digest to the Archive, does the upload, and then compares the digest reported by S3 against the earlier digest, erroring if they differ. I don't know whether the Archive or S3 would error before the client if they detect a digest mismatch, but if they do, should the client try to respond to those errors as well by checking for changes in file size?
I'm unclear what changes you want me to make around |
nope, size check to be always done, regardless if there was an error with upload or not.
If there is already some other problem with upload (md5 mismatch or another) - we can just log error about size difference. If there is no other error -- can raise exception about size change instead (or in addition) to logging.
per above indeed let's check for size change always.
I was just replying that if you see spying on |
@yarikoptic These asides on what changes should be made have gone on too long and have become too hard to follow. Please create a separate issue describing what changes I should make. Please make the initial comment of the issue entirely & completely self-contained so that it is unnecessary to refer back to this thread to understand what is being asked. If the change regarding looping for 0-sized files is supposed to be a separate thing from the check that file sizes didn't change, please create two issues instead. |
@gmazzamuto please try current master version -- after #1374 there should be even more checks/logs and it will be slower for anyone having 0-length files ;-) no need for dedicated env var really but if you set -- should not hurt, and would add mystery if we again hit it at that level. |
Dear all, a quick update just to let you know that with version 0.61.2 I still get the very same errors. Attached is an excerpt from the log (which would be otherwise huge, btw). 20240402073313Z-1281068.log.gz Maybe as you were suggesting the problem is linked to the data being on an NFS share? That's strange because I do all sorts of processing on this data and I have never encountered any issue of this kind. I have a few more batches of data to upload, if you want me to try different things just let me know! Cheers |
@jwodder please review the logs and see if you catch the reason for the problems |
@yarikoptic Per the given logs:
For the first two types, I stand by my analysis above in #1257 (comment). The last type presumably happened due to something being slow. |
For all of those I think we should try to just include those URLs into the next batch (or create a new batch if that was the last one) of upload URLs to try to reattempt with "current" opinion of the filesystem about them. Let's permit to up to 3 such "attempts" (with re-minting upload URL). |
@yarikoptic Exactly what errors should cause the Zarr entry upload to be re-batched (especially now that the HIFNI thing may be solved)? For the Content-MD5 mismatches, should the client redigest the local file? @jjnesbitt If a Zarr asset upload to S3 fails with "Request has expired," is resubmitting it to |
Yes that would be the correct way to retry it. Internally we track no state about a zarr asset (that is, a file within a zarr), and so re-requesting the pre-signed upload URL for the same file multiple times is not an issue. The worst thing that would happen is an overwrite of the existing file, if it's already been successfully uploaded. |
I do not remember the complete list of problems -- I guess those which we coded for so far in the code. You reminded correctly on Content-MD5 mismatch -- since we would not know the reason for it, indeed would make sense to redigest local file as well. |
The comment before your last one might be a start.
Aside from the HIFNI stuff, the only such errors I can think of are the 5xx errors that we retry on, but if the repeated retrying fails, I don't think adding even more retrying is the best idea. |
Dear all,
I am seeing errors while uploading some Zarr files:
Here are some (slightly redacted) excerpts from the log:
Hope this helps!
Giacomo
The text was updated successfully, but these errors were encountered: