-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PubSub: check batch max bytes against request byte size #7108
Comments
At the very least, we should ensure that the default settings do not result in requests sizes that are too large. |
@kir-titievsky, it looks like the Python client currently behaves like the Java client. This feature request would constitute a departure in the logic between clients. It's certainly something to consider, but I'm going to remove the "triaged for GA" label for now. |
hm... I would venture to say this is a bug in both Java and Python, rather than FR, in that the client's batching API does not take into consideration a known property of the service. But I agree that this is not GA blocking. |
@kir-titievsky Based on my somewhat rusty memory, we actually did raise early on when a single message was larger than |
@tseaver, this issue is about how we calculate the size of a The Java equivalent of If that holds up in all languages, then the current calculations are off by |
Agreed on not blocking GA on this.
…On Tue, May 21, 2019, 8:36 AM Solomon Duskis ***@***.***> wrote:
@tseaver <https://github.com/tseaver>, this issue is about how we
calculate the size of a PublishRequest, and not what we do when the size
is too large.
The Java equivalent of types.PublishRequest(topic=topic,
messages=[]).ByteSize() is 0. The smallest value of
types.PublishRequest(messages=[message]).ByteSize() (assuming message is
empty) is 2, and it goes up by 2 + messageSize for every message.
If that holds up in all languages, then the current calculations are off
by len(messages) * 2 bytes. @kir-titievsky
<https://github.com/kir-titievsky>, I personally think that we'd have to
consider this a "new feature" for all languages, and that this ought not be
a GA blocking feature.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7108?email_source=notifications&email_token=AENMYFT5EWNUTY2KBALAACTPWPUD5A5CNFSM4GPBDJB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV3YNQI#issuecomment-494372545>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AENMYFUJ6P7J3JKVPRX5B3DPWPUD5ANCNFSM4GPBDJBQ>
.
|
The calculations can be off by more than |
FWIW: It appears to me that Java uses an extremely low default batch size threshold of 1000 bytes, if I'm understanding the code correctly: https://github.com/googleapis/google-cloud-java/blob/67668c1411169338374b050eae50ed650e318c54/google-cloud-clients/google-cloud-pubsub/src/main/java/com/google/cloud/pubsub/v1/Publisher.java#L451 This would definitely help avoid the problem, since it will call Publish for basically every message. |
Hey. We just bumped into this. Here is an example of an error, as you can see, the request size is a few bytes over 10MB, I suspect the request data. It's very rare though (I estimated in our usecase something like 0.06%). I think we "solved" (workaround?) that by setting MAX_SIZE in BatchSettings to 9.5MB to reserve more space for the request data. |
@hnykda Thanks for the report. It might be that the default MAX_SIZE setting is too high, or that there is a bug in computing the total message size (or both), resulting in the behavior reported. It's good to hear that a (seemingly working) workaround exists, but it will still be worth taking another look at this. |
Happy to help, I definitely agree this should get fixed 👍 , doesn't seem to be super complex on the first sight. Thanks for the work! |
Since there is a size check in the code, but it does not prevent the "request_size is too large" error in all cases, I am re-classifying this as a bug. |
The total publish request size increases by >>> from google.cloud.pubsub import types
>>> req = types.PublishRequest(topic=b"", messages=[])
>>> req.ByteSize()
0
>>> msg = types.PubsubMessage(data=b"foo")
>>> msg.ByteSize()
5
>>> req.messages.append(msg)
>>> req.ByteSize() # increased by 2 + msg size
7
>>> big_msg = types.PubsubMessage(data=b"x" * 999996)
>>> big_msg.ByteSize()
1000000
>>> req.messages.append(big_msg)
>>> req.ByteSize() # increased by 4 + big_msg size
1000011 Currently, the current batch overflow logic does not take this overhead into account, it only considers the total message count and size. Let's check the upper bound on the overhead of a single message: >>> SERVER_MAX_SIZE = 10_000_000
>>> huge_msg = types.PubsubMessage(data=b"x" * (SERVER_MAX_SIZE - 10)
>>> huge_msg.ByteSize()
9999995
>>> req = types.PublishRequest(topic=b"", messages=[]) # ByteSize() == 0
>>> req.messages.append(huge_msg)
>>> req.ByteSize() # 5 + huge_msg size
10000000 The maximum publish request size will still be accepted by the backend is 10e6, which means that the maximum byte size of a message to still fit into that is 9_999_995. At this size, the message length overhead consumes 5 bytes in the As an approximation, the batch size computation logic can be adjusted to add Update:
A possibly slower alternative (needs profiling) is computing the message's size contribution directly as already mentioned in the issue description: new_size = self._size + types.PublishRequest(messages=[message]).ByteSize() The benefit of this approach is that any future changes to the internal protobuf encoding are less likely to break it, thus it should be preferred if its computational overhead is acceptable. |
Is your feature request related to a problem? Please describe.
google.cloud.pubsub_v1.publisher._batch.thread.Batch
enforces max bytes on the sum ofPubsubMessage.ByteSize
for each message in the batch, but thePublishRequest
created byBatch.client.publish
is larger than that. As a result I need to specify a non default max bytes to guarantee batches create valid requests.Describe the solution you'd like
Enforce max bytes on the size of the
PublishRequest
created byBatch.client.publish
:Describe alternatives you've considered
I can set max bytes to guarantee sufficient overhead, but I feel like it would be better if I didn't need to and this may result in fewer batches.
Additional context
see also #7107, which suggests adding an option to enforce this setting even when the batch is empty.
The text was updated successfully, but these errors were encountered: