-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3cmd 2.0 hangs itself and slows down storage nodes #799
Comments
This is related to S3 API Get Bucket Location
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGETlocation.html
As LeoFS doesn't support this query, it falls back to a List Bucket
2017年8月3日(木) 3:03 Vladimir Mosgalin <notifications@github.com>:
… I start the cluster (which has lots of objects in "body" bucket), with
latest devel version and do a single GET operation through s3cmd 2.0
(current stable), which hangs and eventually outputs and error:
$ s3cmd get s3://body/b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz
WARNING: Retrying failed request: /?location (500 (InternalError): We encountered an internal error. Please try again.)
WARNING: Waiting 3 sec...
The load on all storage nodes is ~130% CPU and lasts for minutes even
after s3cmd is interrupted.
Gateway logs:
[W] ***@***.*** 2017-08-02 20:10:23.714351 +0300 1501693823 leo_gateway_rpc_handler:handle_error/5 303 ***@***.***'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] ***@***.*** 2017-08-02 20:10:53.715394 +0300 1501693853 leo_gateway_rpc_handler:handle_error/5 303 ***@***.***'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] ***@***.*** 2017-08-02 20:11:26.810332 +0300 1501693886 leo_gateway_rpc_handler:handle_error/5 303 ***@***.***'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] ***@***.*** 2017-08-02 20:11:56.811397 +0300 1501693916 leo_gateway_rpc_handler:handle_error/5 303 ***@***.***'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
Storage nodes info and error logs:
[I] ***@***.*** 2017-08-02 20:10:04.145006 +0300 1501693804 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,10411}]
[I] ***@***.*** 2017-08-02 20:10:14.939471 +0300 1501693814 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,10794}]
[I] ***@***.*** 2017-08-02 20:10:25.491070 +0300 1501693825 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,10551}]
[I] ***@***.*** 2017-08-02 20:10:40.279663 +0300 1501693840 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,16559}]
[I] ***@***.*** 2017-08-02 20:10:42.212179 +0300 1501693842 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,16721}]
[W] ***@***.*** 2017-08-02 20:10:53.720523 +0300 1501693853 leo_storage_handler_directory:find_by_parent_dir/4 78 ***@***.******@***.******@***.***']},{cause,"Could not get metadatas"}]
[W] ***@***.*** 2017-08-02 20:11:56.824512 +0300 1501693916 leo_storage_handler_directory:find_by_parent_dir/4 78 ***@***.******@***.******@***.***']},{cause,"Could not get metadatas"}]
They are the same on all storage nodes, except for node names.
Status of storage nodes is fine, according to manager. Queues are all
empty. du command hangs during this state.
The same happens for ls operation executed on a single file.
Second try with debug logs enabled, logs are slightly different (note that
this isn't direct continuation of previous experiment, I rolled back all
VMs to an earlier snapshot). Gateway info log:
[D] ***@***.*** 2017-08-02 20:32:26.628294 +0300 1501695146 leo_storage_handler_object:prefix_search/3 909 Parent Dir: <<"body/">>, Marker: <<>>
[I] ***@***.*** 2017-08-02 20:32:42.801368 +0300 1501695162 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,16172}]
[I] ***@***.*** 2017-08-02 20:32:55.743152 +0300 1501695175 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,12941}]
[D] ***@***.*** 2017-08-02 20:32:56.622657 +0300 1501695176 leo_storage_handler_object:prefix_search/3 909 Parent Dir: <<"body/">>, Marker: <<>>
[I] ***@***.*** 2017-08-02 20:33:09.575299 +0300 1501695189 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,12952}]
[I] ***@***.*** 2017-08-02 20:33:21.929499 +0300 1501695201 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,26186}]
[I] ***@***.*** 2017-08-02 20:33:21.953124 +0300 1501695201 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,12377}]
The first line (prefix search) appeared right as I did get operation.
Storage logs:
[D] ***@***.*** 2017-08-02 20:32:26.642627 +0300 1501695146 leo_storage_handler_object:prefix_search/3 909 Parent Dir: <<"body/">>, Marker: <<>>
[I] ***@***.*** 2017-08-02 20:32:44.154902 +0300 1501695164 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,17512}]
[D] ***@***.*** 2017-08-02 20:32:56.637981 +0300 1501695176 leo_storage_handler_object:prefix_search/3 909 Parent Dir: <<"body/">>, Marker: <<>>
[I] ***@***.*** 2017-08-02 20:33:02.9091 +0300 1501695182 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,17854}]
[I] ***@***.*** 2017-08-02 20:33:10.797467 +0300 1501695190 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,14159}]
[I] ***@***.*** 2017-08-02 20:33:25.813767 +0300 1501695205 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,15016}]
tcpdump shows the difference: s3cmd 1.6.1 (which works) executes
GET http://body.s3.amazonaws.com/b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz HTTP/1.1
as first operation, while 2.0 executes
GET http://s3.amazonaws.com/body/?location HTTP/1.1
which hangs (and returns HTTP/1.1 500 Internal Server Error eventually)
The ~/.s3cfg config is the same, it has lines
bucket_location = US
website_endpoint = http://%(bucket)s.s3-website-%(location)s.amazonaws.com/
gateway is specified with proxy_host option.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#799>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAWBtLrxwuK2nKzZlfYgBlAuLD9-dR_Zks5sULnugaJpZM4Orc5G>
|
Related to Issue #780 |
Well, even if it's small bucket which can be listed instantly, s3cmd still fails to work (except for very few operations like "list buckets") because list of buckets isn't what it expects to get as a reply for this request. So it probably won't work at all until that request is implemented. On a positive side, "s3cmd --configure" now asks for endpoint URL and seems to be compatible with specifying LeoFS gateway directly, without proxy trick:
which seem to work (at least for bucket listing). |
Let me dig down into this issue and check what happens.
…On Thu, Aug 3, 2017 at 11:03 PM, Vladimir Mosgalin ***@***.*** > wrote:
Well, even if it's small bucket which can be listed instantly, s3cmd still
fails to work (except for very few operations like "list buckets") because
list of buckets isn't what it expects to get as a reply for this request.
So it probably won't work at all until that request is implemented.
The new code that sends this request https://github.com/s3tools/
s3cmd/blob/master/S3/S3.py#L410-L415 indeed appeared in 2.0.
On a positive side, "s3cmd --configure" now asks for endpoint URL and
seems to be compatible with specifying LeoFS gateway directly, without
proxy trick:
Use "s3.amazonaws.com" for S3 Endpoint and not modify it to the target Amazon S3.
S3 Endpoint [s3.amazonaws.com]: leo-g0.dev.cloud.lan:8080
which seem to work (at least for bucket listing).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#799 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAWBtOYWQ4jT_Mt3MuclhXFhRm6jpa3Aks5sUdM8gaJpZM4Orc5G>
.
--
Wishes,
Wilson
|
I am having trouble to reproduce the issue with s3cmd-2.0.0 as it does not trigger get bucket location during put/get/ls in my case, could you share your Using
and it would fail with subsequent |
@windkit
It's basically one generated by --configure with specifying access/secret key and S3 endpoint (in form of host:port), rest default. Everything is exactly the same when using old config from 1.6.1 which did access through proxy:
bucket listing - |
@vstax thank you. |
Yes, both directly and through the proxy (just host_base though, host_bucket always points to s3.amazonaws.com) |
Summarywith Related IssueActionHandle get bucket location Logs
|
While listing a small bucket should be quick, it could be the |
Could you test this again with the develop branch? @vstax |
@windkit |
I start the cluster (which has lots of objects in "body" bucket), with latest devel version and do a single GET operation through s3cmd 2.0 (current stable), which hangs and eventually outputs and error:
The load on all storage nodes is ~130% CPU and lasts for minutes even after s3cmd is interrupted.
Gateway logs:
Storage nodes info and error logs:
They are the same on all storage nodes, except for node names.
Status of storage nodes is fine, according to manager. Queues are all empty.
du
command hangs during this state.The same happens for
ls
operation executed on a single file.Second try with debug logs enabled, logs are slightly different (note that this isn't direct continuation of previous experiment, I rolled back all VMs to an earlier snapshot). Gateway info log:
The first line (prefix search) appeared right as I did
get
operation.Storage logs:
tcpdump shows the difference: s3cmd 1.6.1 (which works) executes
as first operation, while 2.0 executes
which hangs (and returns
HTTP/1.1 500 Internal Server Error
eventually)The ~/.s3cfg config is the same, it has lines
gateway is specified with
proxy_host
option.The text was updated successfully, but these errors were encountered: