Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors while uploading #1257

Open
gmazzamuto opened this issue Mar 27, 2023 · 69 comments
Open

Errors while uploading #1257

gmazzamuto opened this issue Mar 27, 2023 · 69 comments
Assignees
Labels
awaiting-user-response HIFNI Zarr uploads failing with "A header you provided implies functionality that is not implemented" zarr

Comments

@gmazzamuto
Copy link

Dear all,

I am seeing errors while uploading some Zarr files:

PATH                                                                                        SIZE     ERRORS    UPLOAD STATUS      MESSAGE                                                                                                    
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS03_stain-NeuN_SPIM.ome.zarr         28.2 GB    0         100% done                                                                                                                   
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS03_stain-Somatostatin_SPIM.ome.zarr 27.0 GB    0         100% done                                                                                                                   
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS04_stain-Calretinin_SPIM.ome.zarr   29.8 GB    0         100% done                                                                                                                   
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS04_stain-NeuN_SPIM.ome.zarr         30.0 GB    0         100% done                                                                                                                   
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS04_stain-Somatostatin_SPIM.ome.zarr 28.3 GB    0         100% done                                                                                                                   
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS05_stain-Calretinin_SPIM.ome.zarr   26.8 GB    0          49% ERROR       Error 400 while sending PUT request to https://dandiarchive.s3.amazonaws.com/zarr/40e22a9d-3cf4-4245-aa...
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS05_stain-NeuN_SPIM.ome.zarr         28.1 GB    0           6% ERROR       501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba...
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS05_stain-Somatostatin_SPIM.ome.zarr 26.7 GB    0          56% ERROR       501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/fde6ba34-0c45-49b...
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS06_stain-Calretinin_SPIM.ome.zarr   28.0 GB    0         100% done                                                                                                                   
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS06_stain-NeuN_SPIM.ome.zarr         29.0 GB    0          53% ERROR       501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/e001a5a3-3949-43b...
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS06_stain-Somatostatin_SPIM.ome.zarr 27.7 GB    0         100% done                                                                                                                   
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS07_stain-Calretinin_SPIM.ome.zarr   30.3 GB    0          26% ERROR       Error 400 while sending PUT request to https://dandiarchive.s3.amazonaws.com/zarr/4a299473-0876-4f96-b8...
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS07_stain-NeuN_SPIM.ome.zarr         32.0 GB    0          39% ERROR       501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/be69829d-16dd-48c...
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS07_stain-Somatostatin_SPIM.ome.zarr 30.3 GB    0          53% ERROR       501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/ee44994f-7e2c-45a...
sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS08_stain-Calretinin_SPIM.ome.zarr   29.2 GB    0           2% ERROR       501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/2e5005d7-efd2-493...

Here are some (slightly redacted) excerpts from the log:

2023-03-27T15:49:37+0200 [INFO    ] dandi 3822266:140510692464448 dandi v0.51.0, hdmf v3.5.2, pynwb v2.3.1, h5py v3.8.0
2023-03-27T15:49:37+0200 [INFO    ] dandi 3822266:140510692464448 sys.argv = ['/opt/bin/dandi', 'upload', '--allow-any-path', '--validation', 'ignore', 'sub-I45/ses-SPIM']


2023-03-27T17:01:11+0200 [ERROR   ] dandi 3822266:140508050728704 HTTP connection failed
Traceback (most recent call last):
  File "/opt/lib/python3.8/site-packages/dandi/dandiapi.py", line 210, in request
    for i, attempt in enumerate(
  File "/opt/lib/python3.8/site-packages/tenacity/__init__.py", line 347, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/opt/lib/python3.8/site-packages/tenacity/__init__.py", line 325, in iter
    raise retry_exc.reraise()
  File "/opt/lib/python3.8/site-packages/tenacity/__init__.py", line 158, in reraise
    raise self.last_attempt.result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/opt/lib/python3.8/site-packages/dandi/dandiapi.py", line 240, in request
    result.raise_for_status()
  File "/opt/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42...


2023-03-27T17:01:23+0200 [ERROR   ] dandi 3822266:140509742302976 Error uploading /NIH/DANDI/000026/sub-I45/ses-SPIM/micr/sub-I45_ses-SPIM_sample-BrocaAreaS05_stain-NeuN_SPIM.ome.zarr:
Traceback (most recent call last):
  File "/opt/lib/python3.8/site-packages/dandi/upload.py", line 240, in process_path
    for r in dfile.iter_upload(
  File "/opt/lib/python3.8/site-packages/dandi/files/zarr.py", line 470, in iter_upload
    size = fut.result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/lib/python3.8/site-packages/dandi/files/zarr.py", line 544, in _upload_zarr_file
    storage_session.put(
  File "/opt/lib/python3.8/site-packages/dandi/dandiapi.py", line 311, in put
    return self.request("PUT", path, **kwargs)
  File "/opt/lib/python3.8/site-packages/dandi/dandiapi.py", line 210, in request
    for i, attempt in enumerate(
  File "/opt/lib/python3.8/site-packages/tenacity/__init__.py", line 347, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/opt/lib/python3.8/site-packages/tenacity/__init__.py", line 325, in iter
    raise retry_exc.reraise()
  File "/opt/lib/python3.8/site-packages/tenacity/__init__.py", line 158, in reraise
    raise self.last_attempt.result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/opt/lib/python3.8/site-packages/dandi/dandiapi.py", line 240, in request
    result.raise_for_status()
  File "/opt/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42...


2023-03-27T17:36:08+0200 [DEBUG   ] dandi 3822266:140509784250112 Error uploading zarr: HTTPError: Error 400 while sending PUT request to https://dandiarchive.s3.amazonaws.com/zarr/40e22a9d-3cf4-4245-aa7e-400b65381cd4/0/1/6/86....: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>4f0daa8170fae2bd0b05e09f61b10d66</ExpectedDigest><CalculatedDigest>Zvtu9pNAoZlNjjFTrpMhKQ==</CalculatedDigest><RequestId>1EEDC8B4XNEJNSJ9</RequestId><HostId>AOiMy3L2IOtiyfrA6I+sFe6Nxb0WzDGVoVYyQaTsWWzsSoV9aSl+Y8vGz0oaU8b3u62p9ccCsK4=</HostId></Error>
2023-03-27T17:36:08+0200 [DEBUG   ] dandi 3822266:140509784250112 Cancelling upload

Hope this helps!
Giacomo

@satra
Copy link
Member

satra commented Mar 27, 2023

also pinging @AlmightyYakob as there may be both server and cli interactions at play.

@yarikoptic
Copy link
Member

could you please share the log file which should have been mentioned at the end of the process? May be would help to see what is not implemented and thus causing 501.

@yarikoptic yarikoptic removed their assignment Mar 27, 2023
@gmazzamuto
Copy link
Author

I can't see any meaningful information in the log file besides the snippets I have pasted above, I'm attaching it here anyways!
20230327134937Z-3822266.log.gz

@jwodder
Copy link
Member

jwodder commented Mar 30, 2023

I see two types of errors in the log file:

@gmazzamuto What type of filesystem are you uploading from? Do the errors happen again if you upload again?

@gmazzamuto
Copy link
Author

The files are not modified during the upload. I am uploading from an NFS share. When I reuploaded, there were no errors. The upload I am doing right now is showing two 400 errors and one 501 error.

@yarikoptic
Copy link
Member

NFS can be fun, depending on the enthusiasm (or lack of such) of admins ;) At some point even, to reduce "fun", I just made it all plain sync (read -- slow) to avoid any kind of surprises which I had of various kinds. How "recent" the data files you are uploading were created/modified in respect to their time point of upload? may be we should start recording mtime of files we upload so if smth goes wrong we could check if mtime did not change but I am neither sure it is the whole situation here.

@gmazzamuto
Copy link
Author

Hi Yarik, the files where created several hours before the upload. They are stored on a NAS server with RAID and accessed through an NFS share. Do you think the problems I am seeing are due to the network share? I've never noticed data corruption before.

During the last upload I got several errors, including a 501 but with a slightly different stack trace:

Traceback (most recent call last):                                                                                                                                                                                                           
  File "/opt/bin/dandi", line 8, in <module>                                                                                                                                                                           
    sys.exit(main())                                                                                                                                                                                                                         
  File "/opt/lib/python3.8/site-packages/click/core.py", line 1130, in __call__                                                                                                                                        
    return self.main(*args, **kwargs)                                                                                                                                                                                                        
  File "/opt/lib/python3.8/site-packages/click/core.py", line 1055, in main                                                                                                                                            
    rv = self.invoke(ctx)                                                                                                                                                                                                                    
  File "/opt/lib/python3.8/site-packages/click/core.py", line 1657, in invoke                                                                                                                                          
    return _process_result(sub_ctx.command.invoke(sub_ctx))                                                                                                                                                                                  
  File "/opt/lib/python3.8/site-packages/click/core.py", line 1404, in invoke                                                                                                                                          
    return ctx.invoke(self.callback, **ctx.params)                                                                                                                                                                                           
  File "/opt/lib/python3.8/site-packages/click/core.py", line 760, in invoke                                                                                                                                           
    return __callback(*args, **kwargs)                                                                                                                                                                                                       
  File "/opt/lib/python3.8/site-packages/click/decorators.py", line 38, in new_func                                                                                                                                    
    return f(get_current_context().obj, *args, **kwargs)                                                                                                                                                                                     
  File "/opt/lib/python3.8/site-packages/dandi/cli/base.py", line 102, in wrapper                                                                                                                                      
    return f(*args, **kwargs)                                                                                                                                                                                                                
  File "/opt/lib/python3.8/site-packages/dandi/cli/cmd_upload.py", line 98, in upload                                                                                                                                  
    upload(                                                                                                                                                                                                                                  
  File "/opt/lib/python3.8/site-packages/dandi/upload.py", line 343, in upload                                                                                                                                         
    raise upload_err                                                                                                                                                                                                                         
  File "/opt/lib/python3.8/site-packages/dandi/upload.py", line 240, in process_path
    for r in dfile.iter_upload(                            
  File "/opt/lib/python3.8/site-packages/dandi/files/zarr.py", line 470, in iter_upload
    size = fut.result()                                    
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result                                          
    return self.__get_result()                             
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result                                    
    raise self._exception                                  
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run                                             
    result = self.fn(*self.args, **self.kwargs)            
  File "/opt/lib/python3.8/site-packages/dandi/files/zarr.py", line 544, in _upload_zarr_file
    storage_session.put(                                   
  File "/opt/lib/python3.8/site-packages/dandi/dandiapi.py", line 311, in put
    return self.request("PUT", path, **kwargs)             
  File "/opt/lib/python3.8/site-packages/dandi/dandiapi.py", line 210, in request
    for i, attempt in enumerate(                           
  File "/opt/lib/python3.8/site-packages/tenacity/__init__.py", line 347, in __iter__
    do = self.iter(retry_state=retry_state)                
  File "/opt/lib/python3.8/site-packages/tenacity/__init__.py", line 325, in iter
    raise retry_exc.reraise()                              
  File "/opt/lib/python3.8/site-packages/tenacity/__init__.py", line 158, in reraise
    raise self.last_attempt.result()                       
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result                                          
    return self.__get_result()                             
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result                                    
    raise self._exception                                  
  File "/opt/lib/python3.8/site-packages/dandi/dandiapi.py", line 240, in request
    result.raise_for_status()                              
  File "/opt/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)                                                                    
requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/d5f67f3d-144c-447e-aae0-ca5b979c648d/0/1/22/58?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F
20230330%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230330T134214Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=e65355bb3725ca73f8d313d9e907ddbbeb00c3e7868d392b233b8eaf2eb7b1b5

@yarikoptic
Copy link
Member

@jwodder what about those "not implemented" 501s coming from AWS -- what are those exactly about and isn't it us who submit those requests?

@jwodder
Copy link
Member

jwodder commented Mar 31, 2023

@yarikoptic As before, it's the same problem as #1033.

@yarikoptic
Copy link
Member

Hi Yarik, the files where created several hours before the upload. They are stored on a NAS server with RAID and accessed through an NFS share. Do you think the problems I am seeing are due to the network share?

In that #1033 it also was NFS over beegfs, so so far it is some common aspect here. Do you know your underlying NAS server filesystem?

I've never noticed data corruption before.

it might be not necessarily "corruption" but some other odd behavior/delayed metadata propagation. E.g. in #1033 (comment) @jwodder hypothesizes that may be some inability to get file size reported forces requests to

sends a "Transfer-Encoding" (with value "chunked") when it's unable to determine the size of the request payload .. The only reason I can think of as to why getting the file size should fail would be due to some sort of filesystem hiccup.

During the last upload I got several errors, including a 501 but with a slightly different stack trace:
...
requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/d5f67f3d-144c-447e-aae0-ca5b979c648d/0/1/22/58?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F
20230330%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230330T134214Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=e65355bb3725ca73f8d313d9e907ddbbeb00c3e7868d392b233b8eaf2eb7b1b5

@jwodder believes (and I trust him on that) that it is likely that issue of chunked encoding described in #1033 (comment) and cited above.

@jwodder - do you mean this https://github.com/psf/requests/blob/HEAD/requests/models.py#L549 or some other location? how could we instrument in our code to discover more about the situation better?

@jwodder
Copy link
Member

jwodder commented Apr 3, 2023

@yarikoptic That's one of the location involved, yes. What exactly do you want to discover?

@yarikoptic
Copy link
Member

@yarikoptic That's one of the location involved, yes. What exactly do you want to discover?

ultimately -- resolution to this issue so that there is no error during uploading. For that IMHO we need to troubleshoot further and instrument to gather

  • some evidence that your idea on the underlying issue is right on. If it is (refusal to stat a size) it sounds really odd that size would not be reported, and IMHO then should be reported to the admins of those systems
  • workaround: presumably (double check for this particular case in the logs, report back) we are retrying. It sounds really unlikely that the same filesystem fluke happens reproducibly for an extended period of time. If it happens at requests level reproducibly -- we should be able to detect it at dandi-cli level when catching that exception -- instrument there to confirm that it is indeed about inability to request the size

@yarikoptic
Copy link
Member

Found another old NFS related issue: #764

@jwodder
Copy link
Member

jwodder commented Apr 4, 2023

@yarikoptic Could you be more precise about exactly what sort of instrumentation you want? Your first bullet point sounds like you want to detect & report when requests fails to get a file's size, and your second bullet point also sounds like you want to detect & report when requests fails to get a file's size (except that the second bullet says to do this by inspecting some exception, but the only exception dandi-cli sees is raised in response to S3 returning a 501 error). Am I misinterpreting what you're saying, or are you asking for the exact same thing twice with completely different words?

@yarikoptic
Copy link
Member

Your first bullet point sounds like you want to detect & report when requests fails to get a file's size

correct

and your second bullet point also ... Am I misinterpreting what you're saying, or are you asking for the exact same thing twice with completely different words?

in the 2nd point I

  • asked to double check if we are retrying to send a file on that particular 501 error from AWS
  • continued arguing that if we get to that 501 exception handling after initial error we must fail to get the size the way requests fails to get the size if on retry requests also fails to get the size. If we do not fail to get the size, but requests again leads us to 501 then it is either
    • it is not the "size getting" which flips it and we better provision more,
    • we do get size somehow differently
    • some other factors within requests code which cause it to fail there.

@jwodder
Copy link
Member

jwodder commented Apr 4, 2023

@yarikoptic Yes, we are retrying on the 501 errors. For example, grepping for one of the failed upload URLs from the logs in this comment gives:

2023-03-27T17:00:04+0200 [DEBUG   ] dandi 3822266:140508050728704 PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:11+0200 [DEBUG   ] urllib3.connectionpool 3822266:140508050728704 https://dandiarchive.s3.amazonaws.com:443 "PUT /zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228 HTTP/1.1" 500 None
2023-03-27T17:00:13+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:15+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:18+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:21+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:24+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:29+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:34+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:41+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:49+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:59+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:01:11+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:01:12+0200 [DEBUG   ] dandi 3822266:140509742302976 Error uploading zarr: HTTPError: 501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228

continued arguing that if we get to that 501 exception handling after initial error we must fail to get the size the way requests fails to get the size if on retry requests also fails to get the size.

So, if & when the 501 error occurs (only for the first time for a given uploaded file?), you want dandi-cli to try to get the filesize the same way requests does it, and then what?

@yarikoptic
Copy link
Member

NB I had a question are we "retrying at a proper level" or just sending the "bad" `Content-Length` over and over again? Seems to be yes:

here is the stack for the point where Content-Length is assigned:

(Pdb)   /usr/lib/python3.11/threading.py(995)_bootstrap()
-> self._bootstrap_inner()
  /usr/lib/python3.11/threading.py(1038)_bootstrap_inner()
-> self.run()
  /usr/lib/python3.11/threading.py(975)run()
-> self._target(*self._args, **self._kwargs)
  /usr/lib/python3.11/concurrent/futures/thread.py(83)_worker()
-> work_item.run()
  /usr/lib/python3.11/concurrent/futures/thread.py(58)run()
-> result = self.fn(*self.args, **self.kwargs)
  /home/yoh/proj/dandi/dandi-cli-master/dandi/files/zarr.py(544)_upload_zarr_file()
-> storage_session.put(
  /home/yoh/proj/dandi/dandi-cli-master/dandi/dandiapi.py(311)put()
-> return self.request("PUT", path, **kwargs)
  /home/yoh/proj/dandi/dandi-cli-master/dandi/dandiapi.py(227)request()
-> result = self.session.request(
  /home/yoh/proj/misc/requests/requests/sessions.py(573)request()
-> prep = self.prepare_request(req)
  /home/yoh/proj/misc/requests/requests/sessions.py(484)prepare_request()
-> p.prepare(
  /home/yoh/proj/misc/requests/requests/models.py(371)prepare()
-> self.prepare_body(data, files, json)
> /home/yoh/proj/misc/requests/requests/models.py(550)prepare_body()
-> if length:

so the sessions.py(573)request() is the request we are retrying with tenacity (in the original traceback) so I guess should be all good,.

So, if & when the 501 error occurs (only for the first time for a given uploaded file?), you want dandi-cli to try to get the filesize the same way requests does it, and then what?

if we can get the file size there then it is one of those 3 cases I listed which could be at play here. Most likely imho is that it is some other cause than size getting. But to entirely rule it out -- we would need to overload

❯ grep 'def super' requests/utils.py
def super_len(o):

which is what is used by requests, and instrument to log the "body" (what file it is) and returned size. This way we would discover for sure if that is somehow size to blame. What if we start from there and add instrumentation of that function to be triggered by e.g. DANDI_DEVEL_INSTRUMENT_REQUESTS_SUPERLEN env var?

@yarikoptic
Copy link
Member

@gmazzamuto would you be so kind to try with fresh 0.52.0 release which would log more information about failing requests -- may be it would give us more clue on what is going on.

yarikoptic added a commit that referenced this issue Apr 5, 2023
Log results of `requests.utils.super_len()` when `DANDI_DEVEL_INSTRUMENT_REQUESTS_SUPERLEN` is set
@yarikoptic
Copy link
Member

@gmazzamuto , if you could install from GitHub master branch, and run with DANDI_DEVEL_INSTRUMENT_REQUESTS_SUPERLEN=1 environment variable set, would be even better and logs could help us to figure out more

@gmazzamuto
Copy link
Author

Sure, I can give it a try. But I have completed my uploads for now, there are only some small files left (photos and json sidecar files). I guess it's more likely to happen with larger files, but I can try anyways

@yarikoptic
Copy link
Member

any odd behaviors you were (un)lucky to run into @gmazzamuto ?

@gmazzamuto
Copy link
Author

Hi Yarik, I haven't had the chance to upload new data yet, I will try in the coming days then I'll let you know how it goes!

@gmazzamuto
Copy link
Author

no sorry, my fault: I saw a new available version of dandi-cli and installed that without checking if it contained the relevant commit. I'll send the log using the correct version of the client

@gmazzamuto
Copy link
Author

Hi all, here is the latest log:

log.txt

@jwodder
Copy link
Member

jwodder commented Dec 4, 2023

I see the following problems in the log:

2023-12-02T02:15:59+0100 [DEBUG   ] dandi 3030432:140064092833536 requests.utils.super_len() reported 0 for <_io.BufferedReader name='/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS13_stain-Calretinin_SPIM.ome.zarr/0/1/82/89'>
2023-12-02T02:15:59+0100 [DEBUG   ] dandi 3030432:140064092833536 - Size of 0 is suspicious; double-checking that NFS isn't lying
2023-12-02T02:15:59+0100 [DEBUG   ] dandi 3030432:140064092833536 - stat('/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS13_stain-Calretinin_SPIM.ome.zarr/0/1/82/89') = os.stat_result(st_mode=33188, st_ino=248398250076, st_dev=61, st_nlink=1, st_uid=5000, st_gid=7000, st_size=312461, st_atime=1701477007, st_mtime=1700979160, st_ctime=1700979160)
2023-12-02T02:15:59+0100 [DEBUG   ] dandi 3030432:140064092833536 - fstat('/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS13_stain-Calretinin_SPIM.ome.zarr/0/1/82/89') = os.stat_result(st_mode=33188, st_ino=248398250076, st_dev=61, st_nlink=1, st_uid=5000, st_gid=7000, st_size=312461, st_atime=1701477007, st_mtime=1700979160, st_ctime=1700979160)
RuntimeError: requests.utils.super_len() reported size of 0 for '/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS13_stain-Calretinin_SPIM.ome.zarr/0/1/82/89', but os.stat() reported size 312461 bytes 1 tries later
2023-12-02T07:38:21+0100 [DEBUG   ] dandi 3030432:140061148436224 requests.utils.super_len() reported 0 for <_io.BufferedReader name='/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS17_stain-NeuN_SPIM.ome.zarr/0/1/32/51'>
2023-12-02T07:38:21+0100 [DEBUG   ] dandi 3030432:140061148436224 - Size of 0 is suspicious; double-checking that NFS isn't lying
2023-12-02T07:38:21+0100 [DEBUG   ] dandi 3030432:140061148436224 - stat('/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS17_stain-NeuN_SPIM.ome.zarr/0/1/32/51') = os.stat_result(st_mode=33188, st_ino=248625074703, st_dev=61, st_nlink=1, st_uid=5000, st_gid=7000, st_size=1355361, st_atime=1701494494, st_mtime=1700989187, st_ctime=1700989187)
2023-12-02T07:38:21+0100 [DEBUG   ] dandi 3030432:140061148436224 - fstat('/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS17_stain-NeuN_SPIM.ome.zarr/0/1/32/51') = os.stat_result(st_mode=33188, st_ino=248625074703, st_dev=61, st_nlink=1, st_uid=5000, st_gid=7000, st_size=1355361, st_atime=1701494494, st_mtime=1700989187, st_ctime=1700989187)
RuntimeError: requests.utils.super_len() reported size of 0 for '/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS17_stain-NeuN_SPIM.ome.zarr/0/1/32/51', but os.stat() reported size 1355361 bytes 1 tries later

Conclusion: NFS is a liar.

@yarikoptic
Copy link
Member

"coolio!" another question before we proceed: @jwodder do you think it is pertinent to only size 0 or might be for other occasions that NFS reports wrong size first -- would some check/code fail if file size changes during upload there from the moment of initial call to super_len?

@jwodder
Copy link
Member

jwodder commented Dec 4, 2023

@yarikoptic

would some check/code fail if file size changes during upload there from the moment of initial call to super_len?

That depends on how the S3 server is implemented. The HTTP standard doesn't seem to specify a behavior when a request has an inaccurate Content-Length, so the server could react however it likes, such as by storing a file with a size that matches the Content-Length but not the uploaded data.

@yarikoptic
Copy link
Member

@jwodder

  • most likely it is a NFS behavior oddity specific for size 0, but let's add (at the end of iter_upload?) a check that file size didn't check from the point when it was initially requested to be uploaded. If changed -- raise an informative exception and blame the filesystem and/or user for having file changed under our "feet"
  • for files of size 0 -- add a loop alike we have but re-checking at smaller intervals and up to e.g. 2 seconds and if changes, log a warning and proceed with that new size. That should mitigate this issue @gmazzamuto is experiencing and allow for uploads to become more robust.

@gmazzamuto could you share details of kernel version (uname -a) and the mount options (output of mount grepped for that location) so we have a record of what scenario could potentially lead to such odd behavior. Ideally someone should report/clear it up with NFS (kernel) developers.

@jwodder
Copy link
Member

jwodder commented Dec 4, 2023

@yarikoptic

let's add (at the end of iter_upload?) a check that file size didn't [change] from the point when it was initially requested to be uploaded.

S3 already checks that the uploaded files match the MD5 checksums that the client provided before starting the upload. If a Content-Length were to get set to an inaccurate value and S3 honored it, this check would fail, and the upload would fail with an HTTP error.

for files of size 0 -- add a loop alike we have but re-checking at smaller intervals and up to e.g. 2 seconds and if changes, log a warning and proceed with that new size.

Is this a change to the current loop when spying on super_len() or is this a new loop to be added somewhere else (and, if so, where)?

@yarikoptic
Copy link
Member

let's add (at the end of iter_upload?) a check that file size didn't [change] from the point when it was initially requested to be uploaded.

S3 already checks that the uploaded files match the MD5 checksums that the client provided before starting the upload. If a Content-Length were to get set to an inaccurate value and S3 honored it, this check would fail, and the upload would fail with an HTTP error.

right, but if initial (before upload commenced) read stats of size were wrong (incomplete) of value X bytes, file read only to X bytes, and upload stopped at Content-Length of X bytes -- from AWS perspective all would be good. With extra check we would just ensure that the size didn't silently change during upload.

for files of size 0 -- add a loop alike we have but re-checking at smaller intervals and up to e.g. 2 seconds and if changes, log a warning and proceed with that new size.

Is this a change to the current loop when spying on super_len() or is this a new loop to be added somewhere else (and, if so, where)?

for you to decide, but I think the super_len as common to all uploads (blobs or zarr files, right?) would be the best place albeit a fragile approach. Alternatively -- just the extra check for all places were we might be uploading a file.

@gmazzamuto
Copy link
Author

Yes there seems to be some problems with NFS... However only two of the exceptions seem to be related to the super_len thing. The other exceptions report "The Content-MD5 you specified did not match what we received"

uname -a
Linux atlante 5.4.0-166-generic #183-Ubuntu SMP Mon Oct 2 11:28:33 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

mount | grep NIH
192.168.2.200:/pool_1/DataVol1/Public on /mnt/NIH type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.2.200,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=192.168.2.200,_netdev)

@yarikoptic
Copy link
Member

oh -- indeed! And 3 errors point to the same expected and calculated one
❯ grep 'The Content-MD5 you specified did' log.txt | sort
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>4415dc11527da709bcb9729ce68c642b</ExpectedDigest><CalculatedDigest>fQafc/jEmCnHsiz5Q1YzJw==</CalculatedDigest><RequestId>J6XF28CCXZM6FBFP</RequestId><HostId>IzKVTFkEhPhHVH+EI5a73d+EnjZbHImw2EcCXgO8mdd8O20IDYdTf90/WdZqiz0AMHIUuCTFzVs=</HostId></Error>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>4415dc11527da709bcb9729ce68c642b</ExpectedDigest><CalculatedDigest>fQafc/jEmCnHsiz5Q1YzJw==</CalculatedDigest><RequestId>J6XF28CCXZM6FBFP</RequestId><HostId>IzKVTFkEhPhHVH+EI5a73d+EnjZbHImw2EcCXgO8mdd8O20IDYdTf90/WdZqiz0AMHIUuCTFzVs=</HostId></Error>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>4415dc11527da709bcb9729ce68c642b</ExpectedDigest><CalculatedDigest>fQafc/jEmCnHsiz5Q1YzJw==</CalculatedDigest><RequestId>J6XF28CCXZM6FBFP</RequestId><HostId>IzKVTFkEhPhHVH+EI5a73d+EnjZbHImw2EcCXgO8mdd8O20IDYdTf90/WdZqiz0AMHIUuCTFzVs=</HostId></Error>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>823a57a8016c1d44d24794c7399e0ca9</ExpectedDigest><CalculatedDigest>gv2hsINH/+jlNy0QpKkeLw==</CalculatedDigest><RequestId>HDJQP9RNYP68WF4K</RequestId><HostId>SBMp22+m3Mh0pGpuZe0UXpZE8KmHWopgSNxgXMCOcFj1DciCu9R/6lx3RK8RdDZbz2EGLvxo/CU=</HostId></Error>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>8963fa526c5be1e0f57b95ffd2e9517c</ExpectedDigest><CalculatedDigest>z2Pn5UdgjoSjNHmg28/H9Q==</CalculatedDigest><RequestId>6B8D87Z9MP1E9PNX</RequestId><HostId>a8TVjSov3SapE8lhR3O3tSkXIrZ0EB6qJTCaCl/K4u2rKfySDjUey/txKBl1osjBUW85jKWDg3E=</HostId></Error>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>ab1741f39527631526ea36d12924ae0c</ExpectedDigest><CalculatedDigest>yfoqltT4y8YiKeOtOkiPAg==</CalculatedDigest><RequestId>2CJE832QVMET5K1G</RequestId><HostId>2D6S5wTB4d459i2Nc2FW9kMa0T4Fm7JparIFluz1E+088UtUArc0Kush8QmvUKmjtnlccp6Pxtw=</HostId></Error>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>fa994fde9c8f7863ecbdaa34cd9b52d4</ExpectedDigest><CalculatedDigest>JRpvUxoiyfX7dKr6pB3t3w==</CalculatedDigest><RequestId>7VE63P018K6BCNSC</RequestId><HostId>F4snlJfV9fA+3fOUE2ISiRIdxWSmGQGtfxV0z2dEzWQFnBFPZYfcl/ouwxPEdXFCQUWdG9MVgY0=</HostId></Error>

not sure why it reports calculated in base64 while original in hex, but here is detail on first instance

2023-12-02T03:10:46+0100 [ERROR   ] dandi 3030432:140067982329600 Error uploading /NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS14_stain-NeuN_SPIM.ome.zarr:
Traceback (most recent call last):
  File "/opt/lib/python3.8/site-packages/dandi/upload.py", line 357, in process_path
    for r in dfile.iter_upload(   
  File "/opt/lib/python3.8/site-packages/dandi/files/zarr.py", line 481, in iter_upload
    size = fut.result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/lib/python3.8/site-packages/dandi/files/zarr.py", line 555, in _upload_zarr_file
    storage_session.put( 
  File "/opt/lib/python3.8/site-packages/dandi/dandiapi.py", line 315, in put
    return self.request("PUT", path, **kwargs)
  File "/opt/lib/python3.8/site-packages/dandi/dandiapi.py", line 275, in request
    raise requests.HTTPError(msg, response=result)
requests.exceptions.HTTPError: Error 400 while sending PUT request to https://dandiarchive.s3.amazonaws.com/zarr/87c8b64e-6029-41fc-9640-85f85460fa1e/0/0/52/35?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20231202%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20231202T021034Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=869bd8f5c70ccfb2f4dedf053ab228afc1ba0bc1d4bf9c8aaf4796752010821c: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><ExpectedDigest>fa994fde9c8f7863ecbdaa34cd9b52d4</ExpectedDigest><CalculatedDigest>JRpvUxoiyfX7dKr6pB3t3w==</CalculatedDigest><RequestId>7VE63P018K6BCNSC</RequestId><HostId>F4snlJfV9fA+3fOUE2ISiRIdxWSmGQGtfxV0z2dEzWQFnBFPZYfcl/ouwxPEdXFCQUWdG9MVgY0=</HostId></Error>

of such... I wonder if it could be for that reason of incorrect size reported first? I guess could be tested by incrementally passing longer and longer portions of the file and see if we hit that "ExpectedDigest" (md5 since single part upload?)

@jwodder
Copy link
Member

jwodder commented Dec 4, 2023

@yarikoptic

right, but if initial (before upload commenced) read stats of size were wrong (incomplete) of value X bytes, file read only to X bytes, and upload stopped at Content-Length of X bytes -- from AWS perspective all would be good. With extra check we would just ensure that the size didn't silently change during upload.

I don't think the size reported by stat plays any part in locally calculating the MD5 digest. The client just reads bytes from the file via Python's normal file-reading mechanisms.

for you to decide, but I think the super_len as common to all uploads (blobs or zarr files, right?) would be the best place albeit a fragile approach. Alternatively -- just the extra check for all places were we might be uploading a file.

The super_len instrumentation (when enabled via the envvar) is currently applied to all HTTP requests with bodies (i.e., all uploads) made during a run of dandi upload. With that in mind, is the only change you want for the sleep intervals to start out smaller and gradually increase to a maximum of two seconds?

@yarikoptic
Copy link
Member

right, but if initial (before upload commenced) read stats of size were wrong (incomplete) of value X bytes, file read only to X bytes, and upload stopped at Content-Length of X bytes -- from AWS perspective all would be good. With extra check we would just ensure that the size didn't silently change during upload.

I don't think the size reported by stat plays any part in locally calculating the MD5 digest. The client just reads bytes from the file via Python's normal file-reading mechanisms.

Given that we see those other "digest based" errors too, I personally would not make any bets on interplay between stat's size and IO reads... adding a check is very cheap, so why not to do it? By any chance do you see anything suggesting of size changes already in those logs we have associated with MD5 digests mismatch?

for you to decide, but I think the super_len as common to all uploads (blobs or zarr files, right?) would be the best place albeit a fragile approach. Alternatively -- just the extra check for all places were we might be uploading a file.

The super_len instrumentation (when enabled via the envvar) is currently applied to all HTTP requests with bodies (i.e., all uploads) made during a run of dandi upload. With that in mind, is the only change you want for the sleep intervals to start out smaller and gradually increase to a maximum of two seconds?

I think so, and making it non-conditional on that dedicated env var DANDI_DEVEL_INSTRUMENT_REQUESTS_SUPERLEN

@jwodder
Copy link
Member

jwodder commented Dec 5, 2023

@yarikoptic

Given that we see those other "digest based" errors too, I personally would not make any bets on interplay between stat's size and IO reads... adding a check is very cheap, so why not to do it?

So what exactly is the client supposed to check after uploading a file? Should it check that the size of the local file is the same as at the start of the upload? I don't really see the value in that.

By any chance do you see anything suggesting of size changes already in those logs we have associated with MD5 digests mismatch?

Other than the digest mismatches themselves, no.

I think so, and making it non-conditional on that dedicated env var DANDI_DEVEL_INSTRUMENT_REQUESTS_SUPERLEN

Do you mean that you want super_len() to always be spied on, without the need to set the envvar?

@yarikoptic
Copy link
Member

Given that we see those other "digest based" errors too, I personally would not make any bets on interplay between stat's size and IO reads... adding a check is very cheap, so why not to do it?

So what exactly is the client supposed to check after uploading a file? Should it check that the size of the local file is the same as at the start of the upload? I don't really see the value in that.

It would have (without our extra checks for size 0) caught that size change (e.g. from 0) at the beginning to some other value. And that is what I want us to do here -- to add such a trivial and basic check, so if we get md5 digest mismatch, depending on either we get this error too we would know if it was merely due to size change or something else.

I think so, and making it non-conditional on that dedicated env var DANDI_DEVEL_INSTRUMENT_REQUESTS_SUPERLEN

Do you mean that you want super_len() to always be spied on, without the need to set the envvar?

yes, if that is the easiest way to provision check of the size to get consistent reading for super_len

@jwodder
Copy link
Member

jwodder commented Dec 5, 2023

@yarikoptic

It would have (without our extra checks for size 0) caught that size change (e.g. from 0) at the beginning to some other value. And that is what I want us to do here -- to add such a trivial and basic check, so if we get md5 digest mismatch, depending on either we get this error too we would know if it was merely due to size change or something else.

So, if & only if there's a digest mismatch (which currently already causes an exception), the upload code should stat the uploaded file and check whether its size is the same as at the beginning of the upload? If it's not the same, should that result in a log message or an alternative exception or what?

Side note: When uploading asset blobs (but not Zarr entries), the client computes a digest before uploading, sends the digest to the Archive, does the upload, and then compares the digest reported by S3 against the earlier digest, erroring if they differ. I don't know whether the Archive or S3 would error before the client if they detect a digest mismatch, but if they do, should the client try to respond to those errors as well by checking for changes in file size?

yes, if that is the easiest way to provision check of the size to get consistent reading for super_len

I'm unclear what changes you want me to make around super_len. Are you currently asking (a) for super_len to always be spied on, and (b) for the inter-check sleep for 0-sized files to start out at some small interval and increase up to a maximum of two seconds? Something else?

@yarikoptic
Copy link
Member

So, if & only if there's a digest mismatch (which currently already causes an exception), the upload code should stat the uploaded file and check whether its size is the same as at the beginning of the upload?

nope, size check to be always done, regardless if there was an error with upload or not.

If it's not the same, should that result in a log message or an alternative exception or what?

If there is already some other problem with upload (md5 mismatch or another) - we can just log error about size difference. If there is no other error -- can raise exception about size change instead (or in addition) to logging.

I don't know whether the Archive or S3 would error before the client if they detect a digest mismatch, but if they do, should the client try to respond to those errors as well by checking for changes in file size?

per above indeed let's check for size change always.

I think so, and making it non-conditional on that dedicated env var DANDI_DEVEL_INSTRUMENT_REQUESTS_SUPERLEN

Do you mean that you want super_len() to always be spied on, without the need to set the envvar?

yes, if that is the easiest way to provision check of the size to get consistent reading for super_len

I'm unclear what changes you want me to make around super_len. Are you currently asking (a) for super_len to always be spied on, and (b) for the inter-check sleep for 0-sized files to start out at some small interval and increase up to a maximum of two seconds? Something else?

I was just replying that if you see spying on super_len to be the easiest way to provide desired "no change in size" checks -- do it that way. If spying is not needed/desired, implement it without spying. And yes -- for 0-sized files to check up to 2 seconds.

@jwodder
Copy link
Member

jwodder commented Dec 11, 2023

@yarikoptic These asides on what changes should be made have gone on too long and have become too hard to follow. Please create a separate issue describing what changes I should make. Please make the initial comment of the issue entirely & completely self-contained so that it is unnecessary to refer back to this thread to understand what is being asked. If the change regarding looping for 0-sized files is supposed to be a separate thing from the check that file sizes didn't change, please create two issues instead.

@yarikoptic
Copy link
Member

@gmazzamuto please try current master version -- after #1374 there should be even more checks/logs and it will be slower for anyone having 0-length files ;-) no need for dedicated env var really but if you set -- should not hurt, and would add mystery if we again hit it at that level.

@gmazzamuto
Copy link
Author

gmazzamuto commented Apr 3, 2024

Dear all,

a quick update just to let you know that with version 0.61.2 I still get the very same errors. Attached is an excerpt from the log (which would be otherwise huge, btw).

20240402073313Z-1281068.log.gz

Maybe as you were suggesting the problem is linked to the data being on an NFS share? That's strange because I do all sorts of processing on this data and I have never encountered any issue of this kind.

I have a few more batches of data to upload, if you want me to try different things just let me know!

Cheers
Giacomo

@yarikoptic
Copy link
Member

@jwodder please review the logs and see if you catch the reason for the problems

@jwodder
Copy link
Member

jwodder commented Apr 5, 2024

@yarikoptic Per the given logs:

  • 7 uploads failed with status 400 and the message "The Content-MD5 you specified did not match what we received."
  • 4 uploads failed with status 501.
  • 3 uploads failed with status 403 and the message "Request has expired"

For the first two types, I stand by my analysis above in #1257 (comment). The last type presumably happened due to something being slow.

@yarikoptic
Copy link
Member

For all of those I think we should try to just include those URLs into the next batch (or create a new batch if that was the last one) of upload URLs to try to reattempt with "current" opinion of the filesystem about them. Let's permit to up to 3 such "attempts" (with re-minting upload URL).

@jwodder jwodder added the HIFNI Zarr uploads failing with "A header you provided implies functionality that is not implemented" label May 22, 2024
@jwodder
Copy link
Member

jwodder commented May 23, 2024

@yarikoptic Exactly what errors should cause the Zarr entry upload to be re-batched (especially now that the HIFNI thing may be solved)? For the Content-MD5 mismatches, should the client redigest the local file?

@jjnesbitt If a Zarr asset upload to S3 fails with "Request has expired," is resubmitting it to /zarr/{zarr_id}/files/ the right way (or even a correct way) to retry it? What if the client resubmits a file that wasn't successfully uploaded due to a digest mismatch?

@jjnesbitt
Copy link
Member

jjnesbitt commented May 23, 2024

@jjnesbitt If a Zarr asset upload to S3 fails with "Request has expired," is resubmitting it to /zarr/{zarr_id}/files/ the right way (or even a correct way) to retry it? What if the client resubmits a file that wasn't successfully uploaded due to a digest mismatch?

Yes that would be the correct way to retry it. Internally we track no state about a zarr asset (that is, a file within a zarr), and so re-requesting the pre-signed upload URL for the same file multiple times is not an issue. The worst thing that would happen is an overwrite of the existing file, if it's already been successfully uploaded.

@yarikoptic
Copy link
Member

@yarikoptic Exactly what errors should cause the Zarr entry upload to be re-batched (especially now that the HIFNI thing may be solved)? For the Content-MD5 mismatches, should the client redigest the local file?

I do not remember the complete list of problems -- I guess those which we coded for so far in the code. You reminded correctly on Content-MD5 mismatch -- since we would not know the reason for it, indeed would make sense to redigest local file as well.

@jwodder
Copy link
Member

jwodder commented May 23, 2024

@yarikoptic

I do not remember the complete list of problems

The comment before your last one might be a start.

I guess those which we coded for so far in the code.

Aside from the HIFNI stuff, the only such errors I can think of are the 5xx errors that we retry on, but if the repeated retrying fails, I don't think adding even more retrying is the best idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-response HIFNI Zarr uploads failing with "A header you provided implies functionality that is not implemented" zarr
Projects
None yet
Development

No branches or pull requests

6 participants