Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added reader rewind to remote payload processor #112

Closed
wants to merge 3 commits into from

Conversation

RobGeada
Copy link

@RobGeada RobGeada commented Oct 16, 2023

Motivation

Addresses #111.

Modifications

Adds a byteBuf readerIndex rewind before payloads are parsed in the RemotePayloadProcessor, which deals with any concurrency issues arising from a too-early read of a queued payload. This is a workaround that mitigates the effects of some unknown race condition in bytebuf reading, which should likely still be addressed, but this works in the interim.

Result

Null response payloads no longer occur in high-process-time scenarios

@kserve-oss-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: RobGeada
To complete the pull request process, please assign tjohnson31415 after the PR has been reviewed.
You can assign the PR to them by writing /assign @tjohnson31415 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@ckadner ckadner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @RobGeada -- could this change be tested in a unit test?

@RobGeada
Copy link
Author

@ckadner a mock test has been added, checking if the payload processor functions if the payload has been read too early. It's hard to actually verify the payloads sent on the processor without creating a mock receiver, which seems like perhaps too heavy for a single unit test

@RobGeada
Copy link
Author

@ckadner any updates?

Copy link
Member

@ckadner ckadner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the unit tests.

Going back to your investigation:

It looks like the response bytebufs are read too early somewhere, causing their readerIndex to equal their capacity at processing time. This read occurs somewhere between their addition to the queue in the AsyncPayloadProcessor but before payloads.take() is called. A hacky patch is to call byteBuf = byteBuf.readerIndex(0); as the first line in RemotePayloadProcessor.encodeBinaryToString(), to reset their reader index, and indeed this prevents the issue from arising

Could you add a comment above the line that resets the readerIndex to explain why it is necessary.

Should we spend the time to find out where the response bytebufs are read too early and why. If that read happens erroneously, it might cause problems in other places as well?

ByteBuf byteBuf = Unpooled.wrappedBuffer("{[0, 0.1, 2.3, 4, 5.6]}".getBytes());
String encodedString = RemotePayloadProcessor.encodeBinaryToString(byteBuf);
assertFalse(encodedString.isEmpty());
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a new line at the end of file

@njhill
Copy link
Member

njhill commented Nov 16, 2023

Thanks @RobGeada. I've opened another PR #120 which will hopefully address this properly, would you mind reviewing that?

It would also be useful to have a unit test for this, but the tests included here don't exercise the actual bug. Ideally we'd have a test that actually runs a local modelmesh with an asyncprocessor configured and can trigger the problem (other unit tests already run modelmesh to test other stuff and could hopefully be used as a starting point and adapted).

@ckadner
Copy link
Member

ckadner commented Nov 22, 2023

Should this PR be closed in favor of PR #120 ?

@njhill
Copy link
Member

njhill commented Nov 22, 2023

Should this PR be closed in favor of PR #120 ?

@ckadner yes! @RobGeada would be great if you could review #120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants