Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PubSub: check batch max bytes against request byte size #7108

Closed
relud opened this issue Jan 9, 2019 · 13 comments · Fixed by #9911
Closed

PubSub: check batch max bytes against request byte size #7108

relud opened this issue Jan 9, 2019 · 13 comments · Fixed by #9911
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@relud
Copy link
Contributor

relud commented Jan 9, 2019

Is your feature request related to a problem? Please describe.

google.cloud.pubsub_v1.publisher._batch.thread.Batch enforces max bytes on the sum of PubsubMessage.ByteSize for each message in the batch, but the PublishRequest created by Batch.client.publish is larger than that. As a result I need to specify a non default max bytes to guarantee batches create valid requests.

Describe the solution you'd like

Enforce max bytes on the size of the PublishRequest created by Batch.client.publish:

diff --git a/pubsub/google/cloud/pubsub_v1/publisher/_batch/thread.py b/pubsub/google/cloud/pubsub_v1/publisher/_batch/thread.py
index f187024b7c..67fc04841a 100644
--- a/pubsub/google/cloud/pubsub_v1/publisher/_batch/thread.py
+++ b/pubsub/google/cloud/pubsub_v1/publisher/_batch/thread.py
@@ -76,7 +76,7 @@ class Batch(base.Batch):
         # any writes to them use the "state lock" to remain atomic.
         self._futures = []
         self._messages = []
-        self._size = 0
+        self._size = types.PublishRequest(topic=topic, messages=[]).ByteSize()
         self._status = base.BatchStatus.ACCEPTING_MESSAGES

         # If max latency is specified, start a thread to monitor the batch and
@@ -281,7 +281,7 @@ class Batch(base.Batch):
             if not self.will_accept(message):
                 return future

-            new_size = self._size + message.ByteSize()
+            new_size = self._size + types.PublishRequest(messages=[message]).ByteSize()
             new_count = len(self._messages) + 1
             overflow = (
                 new_size > self.settings.max_bytes

Describe alternatives you've considered

I can set max bytes to guarantee sufficient overhead, but I feel like it would be better if I didn't need to and this may result in fewer batches.

Additional context

see also #7107, which suggests adding an option to enforce this setting even when the batch is empty.

@relud relud changed the title PubSub: check BatchSettings.max_bytes against PublishRequest.ByteSize PubSub: check batch max bytes against request byte size Jan 9, 2019
@tseaver tseaver added api: pubsub Issues related to the Pub/Sub API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Jan 9, 2019
@kir-titievsky
Copy link

At the very least, we should ensure that the default settings do not result in requests sizes that are too large.

@sduskis
Copy link
Contributor

sduskis commented May 20, 2019

@kir-titievsky, it looks like the Python client currently behaves like the Java client. This feature request would constitute a departure in the logic between clients. It's certainly something to consider, but I'm going to remove the "triaged for GA" label for now.

@kir-titievsky
Copy link

hm... I would venture to say this is a bug in both Java and Python, rather than FR, in that the client's batching API does not take into consideration a known property of the service. But I agree that this is not GA blocking.

@tseaver
Copy link
Contributor

tseaver commented May 20, 2019

@kir-titievsky Based on my somewhat rusty memory, we actually did raise early on when a single message was larger than max_bytes (#4872), and re-worked it (#5343) per request (#4608): see @theacodes comment that we needed to match Java / Go.

@sduskis
Copy link
Contributor

sduskis commented May 21, 2019

@tseaver, this issue is about how we calculate the size of a PublishRequest, and not what we do when the size is too large.

The Java equivalent of types.PublishRequest(topic=topic, messages=[]).ByteSize() is 0. The smallest value of types.PublishRequest(messages=[message]).ByteSize() (assuming message is empty) is 2, and it goes up by 2 + messageSize for every message.

If that holds up in all languages, then the current calculations are off by len(messages) * 2 bytes. @kir-titievsky, I personally think that we'd have to consider this a "new feature" for all languages, and that this ought not be a GA blocking feature.

@kir-titievsky
Copy link

kir-titievsky commented May 21, 2019 via email

@evanj
Copy link
Contributor

evanj commented Jun 7, 2019

The calculations can be off by more than len(messages)*2 since protobuf length encoding is variable sized. E.g. I have a test program that adds messages of length 999996. The total length of the request goes up by message.ByteSize() + 4 for each message, since it takes 3 bytes to encode the length of the embedded message in the protobuf representation of a repeated field.

@evanj
Copy link
Contributor

evanj commented Jun 7, 2019

FWIW: It appears to me that Java uses an extremely low default batch size threshold of 1000 bytes, if I'm understanding the code correctly: https://github.com/googleapis/google-cloud-java/blob/67668c1411169338374b050eae50ed650e318c54/google-cloud-clients/google-cloud-pubsub/src/main/java/com/google/cloud/pubsub/v1/Publisher.java#L451

This would definitely help avoid the problem, since it will call Publish for basically every message.

@plamut plamut removed their assignment Aug 15, 2019
@hnykda
Copy link

hnykda commented Nov 29, 2019

Hey. We just bumped into this. Here is an example of an error, as you can see, the request size is a few bytes over 10MB, I suspect the request data. It's very rare though (I estimated in our usecase something like 0.06%). I think we "solved" (workaround?) that by setting MAX_SIZE in BatchSettings to 9.5MB to reserve more space for the request data.

@plamut
Copy link
Contributor

plamut commented Nov 29, 2019

@hnykda Thanks for the report. It might be that the default MAX_SIZE setting is too high, or that there is a bug in computing the total message size (or both), resulting in the behavior reported.

It's good to hear that a (seemingly working) workaround exists, but it will still be worth taking another look at this.

@hnykda
Copy link

hnykda commented Nov 29, 2019

Happy to help, I definitely agree this should get fixed 👍 , doesn't seem to be super complex on the first sight.

Thanks for the work!

@plamut
Copy link
Contributor

plamut commented Nov 29, 2019

Since there is a size check in the code, but it does not prevent the "request_size is too large" error in all cases, I am re-classifying this as a bug.

@plamut plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Nov 29, 2019
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Nov 29, 2019
@plamut plamut self-assigned this Nov 29, 2019
@plamut plamut removed the 🚨 This issue needs some love. label Nov 29, 2019
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Nov 29, 2019
@plamut
Copy link
Contributor

plamut commented Dec 2, 2019

The total publish request size increases by N + message.ByteSize(), where N depends on the length of the message:

>>> from google.cloud.pubsub import types
>>> req = types.PublishRequest(topic=b"", messages=[])
>>> req.ByteSize()
0
>>> msg = types.PubsubMessage(data=b"foo")
>>> msg.ByteSize()
5
>>> req.messages.append(msg)
>>> req.ByteSize()  # increased by 2 + msg size
7
>>> big_msg = types.PubsubMessage(data=b"x" * 999996)
>>> big_msg.ByteSize()
1000000
>>> req.messages.append(big_msg)
>>> req.ByteSize()  # increased by 4 + big_msg size
1000011

Currently, the current batch overflow logic does not take this overhead into account, it only considers the total message count and size.

Let's check the upper bound on the overhead of a single message:

>>> SERVER_MAX_SIZE = 10_000_000
>>> huge_msg = types.PubsubMessage(data=b"x" * (SERVER_MAX_SIZE - 10)
>>> huge_msg.ByteSize()
9999995
>>> req = types.PublishRequest(topic=b"", messages=[])  # ByteSize() == 0
>>> req.messages.append(huge_msg)
>>> req.ByteSize()  # 5 + huge_msg size
10000000

The maximum publish request size will still be accepted by the backend is 10e6, which means that the maximum byte size of a message to still fit into that is 9_999_995. At this size, the message length overhead consumes 5 bytes in the types.PublishRequest message that is sent to the backend.

As an approximation, the batch size computation logic can be adjusted to add 5 for each added message, which is enough to account for the size overhead that causes this issue. A more sophisticated logic could also compute the exact overhead.


Update:
Size increases in bytes for a message depending on the message's byte size (for messages up to byte size of 10e6):

msg.ByteSize() overhead in PublishRequest
< 127 2
>= 127, < 16384 3
>= 16384, < 2_097_152 4
>= 2_097_152 5

A possibly slower alternative (needs profiling) is computing the message's size contribution directly as already mentioned in the issue description:

new_size = self._size + types.PublishRequest(messages=[message]).ByteSize()

The benefit of this approach is that any future changes to the internal protobuf encoding are less likely to break it, thus it should be preferred if its computational overhead is acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
8 participants