Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3cmd 2.0 hangs itself and slows down storage nodes #799

Closed
vstax opened this issue Aug 2, 2017 · 12 comments
Closed

s3cmd 2.0 hangs itself and slows down storage nodes #799

vstax opened this issue Aug 2, 2017 · 12 comments

Comments

@vstax
Copy link
Contributor

vstax commented Aug 2, 2017

I start the cluster (which has lots of objects in "body" bucket), with latest devel version and do a single GET operation through s3cmd 2.0 (current stable), which hangs and eventually outputs and error:

$ s3cmd get s3://body/b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz
WARNING: Retrying failed request: /?location (500 (InternalError): We encountered an internal error. Please try again.)
WARNING: Waiting 3 sec...

The load on all storage nodes is ~130% CPU and lasts for minutes even after s3cmd is interrupted.

Gateway logs:

[W]	gateway_0@192.168.3.52	2017-08-02 20:10:23.714351 +0300	1501693823	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_2@192.168.3.55'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_0@192.168.3.52	2017-08-02 20:10:53.715394 +0300	1501693853	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_1@192.168.3.54'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_0@192.168.3.52	2017-08-02 20:11:26.810332 +0300	1501693886	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_2@192.168.3.55'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_0@192.168.3.52	2017-08-02 20:11:56.811397 +0300	1501693916	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_1@192.168.3.54'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]

Storage nodes info and error logs:

[I]	storage_1@192.168.3.54	2017-08-02 20:10:04.145006 +0300	1501693804	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,10411}]
[I]	storage_1@192.168.3.54	2017-08-02 20:10:14.939471 +0300	1501693814	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,10794}]
[I]	storage_1@192.168.3.54	2017-08-02 20:10:25.491070 +0300	1501693825	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,10551}]
[I]	storage_1@192.168.3.54	2017-08-02 20:10:40.279663 +0300	1501693840	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,16559}]
[I]	storage_1@192.168.3.54	2017-08-02 20:10:42.212179 +0300	1501693842	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,16721}]
[W]	storage_1@192.168.3.54	2017-08-02 20:10:53.720523 +0300	1501693853	leo_storage_handler_directory:find_by_parent_dir/4	78	[{errors,[]},{bad_nodes,['storage_2@192.168.3.55','storage_1@192.168.3.54','storage_0@192.168.3.53']},{cause,"Could not get metadatas"}]
[W]	storage_1@192.168.3.54	2017-08-02 20:11:56.824512 +0300	1501693916	leo_storage_handler_directory:find_by_parent_dir/4	78	[{errors,[]},{bad_nodes,['storage_2@192.168.3.55','storage_1@192.168.3.54','storage_0@192.168.3.53']},{cause,"Could not get metadatas"}]

They are the same on all storage nodes, except for node names.

Status of storage nodes is fine, according to manager. Queues are all empty. du command hangs during this state.

The same happens for ls operation executed on a single file.

Second try with debug logs enabled, logs are slightly different (note that this isn't direct continuation of previous experiment, I rolled back all VMs to an earlier snapshot). Gateway info log:

[D]	storage_0@192.168.3.53	2017-08-02 20:32:26.628294 +0300	1501695146	leo_storage_handler_object:prefix_search/3	909	Parent Dir: <<"body/">>, Marker: <<>>
[I]	storage_0@192.168.3.53	2017-08-02 20:32:42.801368 +0300	1501695162	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,16172}]
[I]	storage_0@192.168.3.53	2017-08-02 20:32:55.743152 +0300	1501695175	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,12941}]
[D]	storage_0@192.168.3.53	2017-08-02 20:32:56.622657 +0300	1501695176	leo_storage_handler_object:prefix_search/3	909	Parent Dir: <<"body/">>, Marker: <<>>
[I]	storage_0@192.168.3.53	2017-08-02 20:33:09.575299 +0300	1501695189	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,12952}]
[I]	storage_0@192.168.3.53	2017-08-02 20:33:21.929499 +0300	1501695201	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,26186}]
[I]	storage_0@192.168.3.53	2017-08-02 20:33:21.953124 +0300	1501695201	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,12377}]

The first line (prefix search) appeared right as I did get operation.

Storage logs:

[D]	storage_1@192.168.3.54	2017-08-02 20:32:26.642627 +0300	1501695146	leo_storage_handler_object:prefix_search/3	909	Parent Dir: <<"body/">>, Marker: <<>>
[I]	storage_1@192.168.3.54	2017-08-02 20:32:44.154902 +0300	1501695164	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,17512}]
[D]	storage_1@192.168.3.54	2017-08-02 20:32:56.637981 +0300	1501695176	leo_storage_handler_object:prefix_search/3	909	Parent Dir: <<"body/">>, Marker: <<>>
[I]	storage_1@192.168.3.54	2017-08-02 20:33:02.9091 +0300	1501695182	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,17854}]
[I]	storage_1@192.168.3.54	2017-08-02 20:33:10.797467 +0300	1501695190	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,14159}]
[I]	storage_1@192.168.3.54	2017-08-02 20:33:25.813767 +0300	1501695205	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{method,fetch},{key,<<"body/">>},{processing_time,15016}]

tcpdump shows the difference: s3cmd 1.6.1 (which works) executes

GET http://body.s3.amazonaws.com/b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz HTTP/1.1

as first operation, while 2.0 executes

GET http://s3.amazonaws.com/body/?location HTTP/1.1

which hangs (and returns HTTP/1.1 500 Internal Server Error eventually)

The ~/.s3cfg config is the same, it has lines

bucket_location = US
website_endpoint = http://%(bucket)s.s3-website-%(location)s.amazonaws.com/

gateway is specified with proxy_host option.

@windkit
Copy link
Contributor

windkit commented Aug 2, 2017 via email

@windkit
Copy link
Contributor

windkit commented Aug 3, 2017

Related to Issue #780

@vstax
Copy link
Contributor Author

vstax commented Aug 3, 2017

Well, even if it's small bucket which can be listed instantly, s3cmd still fails to work (except for very few operations like "list buckets") because list of buckets isn't what it expects to get as a reply for this request. So it probably won't work at all until that request is implemented.
The new code that sends this request before almost any operation https://github.com/s3tools/s3cmd/blob/master/S3/S3.py#L410-L415 indeed appeared in 2.0. Apparently it's supposed to cache the result, but since it can't do it even once, I can't check it.

On a positive side, "s3cmd --configure" now asks for endpoint URL and seems to be compatible with specifying LeoFS gateway directly, without proxy trick:

Use "s3.amazonaws.com" for S3 Endpoint and not modify it to the target Amazon S3.
S3 Endpoint [s3.amazonaws.com]: leo-g0.dev.cloud.lan:8080

which seem to work (at least for bucket listing).

@windkit
Copy link
Contributor

windkit commented Aug 4, 2017 via email

@windkit
Copy link
Contributor

windkit commented Aug 6, 2017

I am having trouble to reproduce the issue with s3cmd-2.0.0 as it does not trigger get bucket location during put/get/ls in my case, could you share your .s3cfg?

Using s3cmd info s3://test/ would trigger a /?location request, and LeoFS does respond as bucket list

DEBUG: Sending request method_string='GET', uri=u'http://test.s3.amazonaws.com/?location', headers={'Authorization': u'AWS 05236:QS6n33QSE70xM9F+Jp9DT92Ds1g=', 'x-amz-date': 'Sun, 06 Aug 2017 04:13:42 +0000'}, body=(0 bytes)
DEBUG: ConnMan.put(): closing proxy connection (keep-alive not yet supported)
DEBUG: Response:
{'data': '<?xml version="1.0" encoding="UTF-8"?><ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Name>test</Name><Prefix></Prefix><Marker></Marker><MaxKeys>1000</MaxKeys><Delimiter>/</Delimiter><Contents><Key>test</Key><LastModified>2017-08-06T03:48:03.0
00Z</LastModified><ETag>881f1f9e800e033d59e2d63e6ab81642</ETag><Size>143786</Size><StorageClass>STANDARD</StorageClass><Owner><ID>leofs</ID><DisplayName>leofs</DisplayName></Owner></Contents><IsTruncated>false</IsTruncated><NextMarker></NextMarker></ListBucketResult>',
 'headers': {'connection': 'close',
             'content-type': 'application/xml',
             'date': 'Sun, 06 Aug 2017 04:13:41 GMT',
             'server': 'LeoFS',
             'transfer-encoding': 'identity'},
 'reason': 'OK',
 'status': 200}

and it would fail with subsequent /?lifecycle query

@vstax
Copy link
Contributor Author

vstax commented Aug 7, 2017

@windkit
Here it is:

[default]
access_key = 8b8124cd9ba4e9e191bf
access_token = 
add_encoding_exts = 
add_headers = 
bucket_location = US
ca_certs_file = 
cache_file = 
check_ssl_certificate = True
check_ssl_hostname = True
cloudfront_host = cloudfront.amazonaws.com
default_mime_type = binary/octet-stream
delay_updates = False
delete_after = False
delete_after_fetch = False
delete_removed = False
dry_run = False
enable_multipart = True
encoding = UTF-8
encrypt = False
expiry_date = 
expiry_days = 
expiry_prefix = 
follow_symlinks = False
force = False
get_continue = False
gpg_command = /usr/bin/gpg
gpg_decrypt = %(gpg_command)s -d --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_encrypt = %(gpg_command)s -c --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_passphrase = 
guess_mime_type = True
host_base = leo-g0.dev.cloud.lan:8080
host_bucket = %(bucket)s.s3.amazonaws.com
human_readable_sizes = False
invalidate_default_index_on_cf = False
invalidate_default_index_root_on_cf = True
invalidate_on_cf = False
kms_key = 
limit = -1
limitrate = 0
list_md5 = False
log_target_prefix = 
long_listing = False
max_delete = -1
mime_type = 
multipart_chunk_size_mb = 15
multipart_max_chunks = 10000
preserve_attrs = True
progress_meter = True
proxy_host = 
proxy_port = 0
put_continue = False
recursive = False
recv_chunk = 65536
reduced_redundancy = False
requester_pays = False
restore_days = 1
restore_priority = Standard
secret_key = <>
send_chunk = 65536
server_side_encryption = False
signature_v2 = False
simpledb_host = sdb.amazonaws.com
skip_existing = False
socket_timeout = 300
stats = False
stop_on_error = False
storage_class = 
urlencoding_mode = normal
use_http_expect = False
use_https = False
use_mime_magic = True
verbosity = WARNING
website_endpoint = http://%(bucket)s.s3-website-%(location)s.amazonaws.com/
website_error = 
website_index = index.html

It's basically one generated by --configure with specifying access/secret key and S3 endpoint (in form of host:port), rest default. Everything is exactly the same when using old config from 1.6.1 which did access through proxy:

host_base = s3.amazonaws.com
host_bucket = %(bucket)s.s3.amazonaws.com
proxy_host = leo-g0.dev.cloud.lan
proxy_port = 8080

bucket listing - s3cmd ls s3:// - is basically the only thing that works, get or ls on file or directory triggers bucket location request, which fails.

@windkit
Copy link
Contributor

windkit commented Aug 7, 2017

@vstax thank you.
Just one more question, with host_base setting to LeoFS directly, does the same issue remain?

@vstax
Copy link
Contributor Author

vstax commented Aug 7, 2017

Yes, both directly and through the proxy (just host_base though, host_bucket always points to s3.amazonaws.com)

@windkit
Copy link
Contributor

windkit commented Aug 8, 2017

Summary

with s3cmd-2.0.0, it would issue a get bucket location when the bucket region is US and signature v4 is used.
LeoFS fails to recognize the query and processes it as get bucket

Related Issue

#780

Action

Handle get bucket location

Logs

host_base = node01:12345
host_bucket = %(bucket)s.node01:12345
...
DEBUG: CreateRequest: resource[uri]=/test
DEBUG: ===== Send_file inner request to determine the bucket region =====
DEBUG: CreateRequest: resource[uri]=/
DEBUG: Using signature v4
DEBUG: get_hostname(test): node01:12345
DEBUG: canonical_headers = host:node01:12345
x-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
x-amz-date:20170808T011810Z

DEBUG: Canonical Request:
GET
/test/
location=
host:node01:12345
x-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
x-amz-date:20170808T011810Z

host;x-amz-content-sha256;x-amz-date
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
----------------------
DEBUG: signature-v4 headers: {'x-amz-content-sha256': u'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'Authorization': u'AWS4-HMAC-SHA256 Credential=05236/20170808/us-east-1/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date,Signature=
37dcf56f4c0bacbfa3c9e5e70f89a18a2c02323ec898da86d2929c0650a9d2e3', 'x-amz-date': '20170808T011810Z'}
DEBUG: Processing request, please wait...
DEBUG: get_hostname(test): node01:12345
DEBUG: ConnMan.get(): creating new connection: http://node01:12345
DEBUG: non-proxied HTTPConnection(node01, 12345)
DEBUG: format_uri(): /test/?location
DEBUG: Sending request method_string='GET', uri=u'/test/?location', headers={'x-amz-content-sha256': u'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'Authorization': u'AWS4-HMAC-SHA256 Credential=05236/20170808/us-east-1/s3/aws4_request,SignedHeaders=
host;x-amz-content-sha256;x-amz-date,Signature=37dcf56f4c0bacbfa3c9e5e70f89a18a2c02323ec898da86d2929c0650a9d2e3', 'x-amz-date': '20170808T011810Z'}, body=(0 bytes)
DEBUG: ConnMan.put(): connection put back to pool (http://node01:12345#1)
DEBUG: Response:
{'data': '<?xml version="1.0" encoding="UTF-8"?><ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Name>test</Name><Prefix></Prefix><Marker></Marker><MaxKeys>1000</MaxKeys><Delimiter>/</Delimiter><Contents><Key>test</Key><LastModified>2017-08-08T01:02:42.0
00Z</LastModified><ETag>881f1f9e800e033d59e2d63e6ab81642</ETag><Size>143786</Size><StorageClass>STANDARD</StorageClass><Owner><ID>leofs</ID><DisplayName>leofs</DisplayName></Owner></Contents><IsTruncated>false</IsTruncated><NextMarker></NextMarker></ListBucketResult>',
 'headers': {'connection': 'close',
             'content-type': 'application/xml',
             'date': 'Tue, 08 Aug 2017 01:18:08 GMT',
             'server': 'LeoFS',
             'transfer-encoding': 'identity'},
 'reason': 'OK',
 'status': 200}
DEBUG: ===== END Send_file inner request to determine the bucket region ('us-east-1') =====
upload: 's3cmd' -> 's3://test/test'  [1 of 1]
DEBUG: DeUnicodising u's3cmd' using UTF-8
DEBUG: Using signature v4
DEBUG: get_hostname(test): test.node01:12345
DEBUG: canonical_headers = content-length:143786
content-type:binary/octet-stream
host:test.node01:12345
x-amz-content-sha256:80bd9f86a536a4c5b1ab8c8417465b690c43079cada83360ad58eb371ffbcdd5
x-amz-date:20170808T011810Z
x-amz-meta-s3cmd-attrs:atime:1502153153/ctime:1502153146/gid:2001/gname:rit/md5:881f1f9e800e033d59e2d63e6ab81642/mode:33261/mtime:1502153146/uid:506/uname:wilson
x-amz-storage-class:STANDARD

DEBUG: Canonical Request:
PUT
/test

content-length:143786
content-type:binary/octet-stream
host:test.node01:12345
x-amz-content-sha256:80bd9f86a536a4c5b1ab8c8417465b690c43079cada83360ad58eb371ffbcdd5
x-amz-date:20170808T011810Z
x-amz-meta-s3cmd-attrs:atime:1502153153/ctime:1502153146/gid:2001/gname:rit/md5:881f1f9e800e033d59e2d63e6ab81642/mode:33261/mtime:1502153146/uid:506/uname:wilson
x-amz-storage-class:STANDARD

content-length;content-type;host;x-amz-content-sha256;x-amz-date;x-amz-meta-s3cmd-attrs;x-amz-storage-class
80bd9f86a536a4c5b1ab8c8417465b690c43079cada83360ad58eb371ffbcdd5
----------------------
DEBUG: signature-v4 headers: {'x-amz-content-sha256': u'80bd9f86a536a4c5b1ab8c8417465b690c43079cada83360ad58eb371ffbcdd5', 'content-length': '143786', 'x-amz-storage-class': 'STANDARD', 'x-amz-meta-s3cmd-attrs': u'atime:1502153153/ctime:1502153146/gid:2001/gname:rit/md5:8
81f1f9e800e033d59e2d63e6ab81642/mode:33261/mtime:1502153146/uid:506/uname:wilson', 'x-amz-date': '20170808T011810Z', 'content-type': 'binary/octet-stream', 'Authorization': u'AWS4-HMAC-SHA256 Credential=05236/20170808/us-east-1/s3/aws4_request,SignedHeaders=content-length
;content-type;host;x-amz-content-sha256;x-amz-date;x-amz-meta-s3cmd-attrs;x-amz-storage-class,Signature=dc17089a4622c8cf0ff7c4e9300bb4e5a7418e355380a064ee0d359a535f99e8'}
DEBUG: get_hostname(test): test.node01:12345
DEBUG: ConnMan.get(): creating new connection: http://test.node01:12345
DEBUG: non-proxied HTTPConnection(test.node01, 12345)
DEBUG: format_uri(): /test
  65536 of 143786    45% in    0s     7.93 MB/sDEBUG: ConnMan.put(): connection put back to pool (http://test.node01:12345#1)
DEBUG: Response:
{'data': '',
 'headers': {'connection': 'keep-alive',
             'content-length': '0',
             'date': 'Tue, 08 Aug 2017 01:18:08 GMT',
             'etag': '"881f1f9e800e033d59e2d63e6ab81642"',
             'server': 'LeoFS'},
 'reason': 'OK',
 'size': 143786,
 'status': 200}
 143786 of 143786   100% in    0s    10.66 MB/s  done
DEBUG: MD5 sums: computed=881f1f9e800e033d59e2d63e6ab81642, received=881f1f9e800e033d59e2d63e6ab81642

@windkit
Copy link
Contributor

windkit commented Aug 9, 2017

While listing a small bucket should be quick, it could be the leo_storage is blocking in previous large bucket listing

@windkit
Copy link
Contributor

windkit commented Nov 9, 2017

Could you test this again with the develop branch? @vstax

@vstax
Copy link
Contributor Author

vstax commented Nov 9, 2017

@windkit
Yes, it works fine now - both with proxy connection method and when specifying gateway in host_base/host_bucket options directly. Thank you.

@vstax vstax closed this as completed Nov 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants