Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky capture service start #20024

Merged
merged 4 commits into from
Aug 29, 2024

Conversation

zbud-msft
Copy link
Contributor

@zbud-msft zbud-msft commented Aug 27, 2024

Why I did it

Dropping control character (message sent when XSUB connects to XPUB as part of ZMQ Proxy setup to notify that subscription has been made) in do capture has been flaky since control character is not guaranteed to be the first message sent if there are events (like event-down-ctr) being published to XSUB.

Scenarios

  1. Control character is sent and is first message when starting capture service

eventd#eventd#eventd: :- heartbeat_ctrl: Set heartbeat_ctrl pause=1
eventd#eventd#eventd: :- do_capture: Received subscription message when XSUB connects to XPUB

  1. Events like event-down ctr is sent before control character

eventd#eventd#eventd: :- run: Dropping Message: 22 serialization::archive 18 17 sonic-events-host
eventd#eventd#eventd: :- run: Dropping Message: 22 serialization::archive 18 0 0 4 0 0 0 1 d 103 {"sonic-events-host:event-stopped-ctr":{"ctr_name":"EVENTD","timestamp":"2024-08-27T00:02:51.407518Z"}} 1 r 36 3357542f-bae1-458f-a804-660e620d21f5 1 s 1 9 1 t 19 1724716971407591080
heartbeat_ctrl: Set heartbeat_ctrl pause=1
do_capture: Received subscription message when XSUB connects to XPUB

  1. Control character is not sent at all

eventd#eventd#eventd: :- heartbeat_ctrl: Set heartbeat_ctrl pause=1

  1. Control character is delayed and not caught when starting capture service, but is then caught after causing deserialize error.

do_capture: Receiving event from source: 22 serialization::archive 18 17 sonic-events-host, will read second part of event
deserialize: deserialize Failed: input stream errorstr[0:64]:(#001) data type: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&
zmq_read_part: Failed to deserialize part rc=-2
zmq_read_part: last:errno=11
zmq_message_read: Failure to read part1 rc=-2
zmq_message_read: last:errno=11

We can cover these scenarios by just dropping the control character inside zmq_message_read as part of events_common in swsscommon (different PR). In this PR we will remove such handling logic and make sure that empty events that will be sent by control character are ignored.

Work item tracking
  • Microsoft ADO (number only):28728116

How I did it

Remove logic for handling control character

How to verify it

UT and sonic-mgmt test cases.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Copy link

linux-foundation-easycla bot commented Aug 27, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@zbud-msft zbud-msft force-pushed the capture_service_logic_fix branch from fe2cf8d to 967caa8 Compare August 27, 2024 00:54
@zbud-msft zbud-msft marked this pull request as ready for review August 27, 2024 00:54
@zbud-msft zbud-msft requested a review from lguohan as a code owner August 27, 2024 00:54
@qiluo-msft qiluo-msft requested a review from liuh-80 August 27, 2024 01:29
int more = 0;
size_t more_size = sizeof(more);
zmq_getsockopt(cap_sub_sock, ZMQ_RCVMORE, &more, &more_size);
if(more) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a loop here to read all message?

Also, if there is an ADO please update the PR description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed logic to not account for capture service capturing control character. Will be done in zmq_read_part

sonic-net/sonic-swss-common#906

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#Closed



if(rc == 1) { // Expected case to receive subscription message as very first message
SWSS_LOG_INFO("Received subscription message when XSUB connects to XPUB");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are too much "received subscription message" log, suggest change to SWSS_LOG_DEBUG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will happen only once during process lifetime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#Closed

@zbud-msft
Copy link
Contributor Author

Relies on sonic-net/sonic-swss-common#906

@qiluo-msft qiluo-msft merged commit a5a2cf1 into sonic-net:master Aug 29, 2024
23 checks passed
@zbud-msft
Copy link
Contributor Author

@StormLiangMS Can we backport this to 202311?

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Sep 4, 2024
#### Why I did it

Dropping control character (message sent when XSUB connects to XPUB as part of ZMQ Proxy setup to notify that subscription has been made) in do capture has been flaky since control character is not guaranteed to be the first message sent if there are events (like event-down-ctr) being published to XSUB.

Scenarios

1) Control character is sent and is first message when starting capture service

`eventd#eventd#eventd: :- heartbeat_ctrl: Set heartbeat_ctrl pause=1`
`eventd#eventd#eventd: :- do_capture: Received subscription message when XSUB connects to XPUB`

2) Events like event-down ctr is sent before control character

`eventd#eventd#eventd: :- run: Dropping Message: 22 serialization::archive 18 17 sonic-events-host`
`eventd#eventd#eventd: :- run: Dropping Message: 22 serialization::archive 18 0 0 4 0 0 0 1 d 103 {"sonic-events-host:event-stopped-ctr":{"ctr_name":"EVENTD","timestamp":"2024-08-27T00:02:51.407518Z"}} 1 r 36 3357542f-bae1-458f-a804-660e620d21f5 1 s 1 9 1 t 19 1724716971407591080`
`heartbeat_ctrl: Set heartbeat_ctrl pause=1`
`do_capture: Received subscription message when XSUB connects to XPUB`

3) Control character is not sent at all

`eventd#eventd#eventd: :- heartbeat_ctrl: Set heartbeat_ctrl pause=1`

4) Control character is delayed and not caught when starting capture service, but is then caught after causing deserialize error.

`do_capture: Receiving event from source: 22 serialization::archive 18 17 sonic-events-host, will read second part of event`
`deserialize: deserialize Failed: input stream errorstr[0:64]:(sonic-net#1) data type: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&`
`zmq_read_part: Failed to deserialize part rc=-2`
`zmq_read_part: last:errno=11`
`zmq_message_read: Failure to read part1 rc=-2`
`zmq_message_read: last:errno=11`

We can cover these scenarios by just dropping the control character inside zmq_message_read as part of events_common in swsscommon (different PR). In this PR we will remove such handling logic and make sure that empty events that will be sent by control character are ignored.

##### Work item tracking
- Microsoft ADO **(number only)**:28728116

#### How I did it

Remove logic for handling control character

#### How to verify it

UT and sonic-mgmt test cases.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #20130

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Sep 4, 2024
#### Why I did it

Dropping control character (message sent when XSUB connects to XPUB as part of ZMQ Proxy setup to notify that subscription has been made) in do capture has been flaky since control character is not guaranteed to be the first message sent if there are events (like event-down-ctr) being published to XSUB.

Scenarios

1) Control character is sent and is first message when starting capture service

`eventd#eventd#eventd: :- heartbeat_ctrl: Set heartbeat_ctrl pause=1`
`eventd#eventd#eventd: :- do_capture: Received subscription message when XSUB connects to XPUB`

2) Events like event-down ctr is sent before control character

`eventd#eventd#eventd: :- run: Dropping Message: 22 serialization::archive 18 17 sonic-events-host`
`eventd#eventd#eventd: :- run: Dropping Message: 22 serialization::archive 18 0 0 4 0 0 0 1 d 103 {"sonic-events-host:event-stopped-ctr":{"ctr_name":"EVENTD","timestamp":"2024-08-27T00:02:51.407518Z"}} 1 r 36 3357542f-bae1-458f-a804-660e620d21f5 1 s 1 9 1 t 19 1724716971407591080`
`heartbeat_ctrl: Set heartbeat_ctrl pause=1`
`do_capture: Received subscription message when XSUB connects to XPUB`

3) Control character is not sent at all

`eventd#eventd#eventd: :- heartbeat_ctrl: Set heartbeat_ctrl pause=1`

4) Control character is delayed and not caught when starting capture service, but is then caught after causing deserialize error.

`do_capture: Receiving event from source: 22 serialization::archive 18 17 sonic-events-host, will read second part of event`
`deserialize: deserialize Failed: input stream errorstr[0:64]:(sonic-net#1) data type: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&`
`zmq_read_part: Failed to deserialize part rc=-2`
`zmq_read_part: last:errno=11`
`zmq_message_read: Failure to read part1 rc=-2`
`zmq_message_read: last:errno=11`

We can cover these scenarios by just dropping the control character inside zmq_message_read as part of events_common in swsscommon (different PR). In this PR we will remove such handling logic and make sure that empty events that will be sent by control character are ignored.

##### Work item tracking
- Microsoft ADO **(number only)**:28728116

#### How I did it

Remove logic for handling control character

#### How to verify it

UT and sonic-mgmt test cases.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #20131

mssonicbld pushed a commit that referenced this pull request Sep 4, 2024
#### Why I did it

Dropping control character (message sent when XSUB connects to XPUB as part of ZMQ Proxy setup to notify that subscription has been made) in do capture has been flaky since control character is not guaranteed to be the first message sent if there are events (like event-down-ctr) being published to XSUB.

Scenarios

1) Control character is sent and is first message when starting capture service

`eventd#eventd#eventd: :- heartbeat_ctrl: Set heartbeat_ctrl pause=1`
`eventd#eventd#eventd: :- do_capture: Received subscription message when XSUB connects to XPUB`

2) Events like event-down ctr is sent before control character

`eventd#eventd#eventd: :- run: Dropping Message: 22 serialization::archive 18 17 sonic-events-host`
`eventd#eventd#eventd: :- run: Dropping Message: 22 serialization::archive 18 0 0 4 0 0 0 1 d 103 {"sonic-events-host:event-stopped-ctr":{"ctr_name":"EVENTD","timestamp":"2024-08-27T00:02:51.407518Z"}} 1 r 36 3357542f-bae1-458f-a804-660e620d21f5 1 s 1 9 1 t 19 1724716971407591080`
`heartbeat_ctrl: Set heartbeat_ctrl pause=1`
`do_capture: Received subscription message when XSUB connects to XPUB`

3) Control character is not sent at all

`eventd#eventd#eventd: :- heartbeat_ctrl: Set heartbeat_ctrl pause=1`

4) Control character is delayed and not caught when starting capture service, but is then caught after causing deserialize error.

`do_capture: Receiving event from source: 22 serialization::archive 18 17 sonic-events-host, will read second part of event`
`deserialize: deserialize Failed: input stream errorstr[0:64]:(#1) data type: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&`
`zmq_read_part: Failed to deserialize part rc=-2`
`zmq_read_part: last:errno=11`
`zmq_message_read: Failure to read part1 rc=-2`
`zmq_message_read: last:errno=11`

We can cover these scenarios by just dropping the control character inside zmq_message_read as part of events_common in swsscommon (different PR). In this PR we will remove such handling logic and make sure that empty events that will be sent by control character are ignored.

##### Work item tracking
- Microsoft ADO **(number only)**:28728116

#### How I did it

Remove logic for handling control character

#### How to verify it

UT and sonic-mgmt test cases.
mssonicbld pushed a commit that referenced this pull request Sep 4, 2024
#### Why I did it

Dropping control character (message sent when XSUB connects to XPUB as part of ZMQ Proxy setup to notify that subscription has been made) in do capture has been flaky since control character is not guaranteed to be the first message sent if there are events (like event-down-ctr) being published to XSUB.

Scenarios

1) Control character is sent and is first message when starting capture service

`eventd#eventd#eventd: :- heartbeat_ctrl: Set heartbeat_ctrl pause=1`
`eventd#eventd#eventd: :- do_capture: Received subscription message when XSUB connects to XPUB`

2) Events like event-down ctr is sent before control character

`eventd#eventd#eventd: :- run: Dropping Message: 22 serialization::archive 18 17 sonic-events-host`
`eventd#eventd#eventd: :- run: Dropping Message: 22 serialization::archive 18 0 0 4 0 0 0 1 d 103 {"sonic-events-host:event-stopped-ctr":{"ctr_name":"EVENTD","timestamp":"2024-08-27T00:02:51.407518Z"}} 1 r 36 3357542f-bae1-458f-a804-660e620d21f5 1 s 1 9 1 t 19 1724716971407591080`
`heartbeat_ctrl: Set heartbeat_ctrl pause=1`
`do_capture: Received subscription message when XSUB connects to XPUB`

3) Control character is not sent at all

`eventd#eventd#eventd: :- heartbeat_ctrl: Set heartbeat_ctrl pause=1`

4) Control character is delayed and not caught when starting capture service, but is then caught after causing deserialize error.

`do_capture: Receiving event from source: 22 serialization::archive 18 17 sonic-events-host, will read second part of event`
`deserialize: deserialize Failed: input stream errorstr[0:64]:(#1) data type: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&`
`zmq_read_part: Failed to deserialize part rc=-2`
`zmq_read_part: last:errno=11`
`zmq_message_read: Failure to read part1 rc=-2`
`zmq_message_read: last:errno=11`

We can cover these scenarios by just dropping the control character inside zmq_message_read as part of events_common in swsscommon (different PR). In this PR we will remove such handling logic and make sure that empty events that will be sent by control character are ignored.

##### Work item tracking
- Microsoft ADO **(number only)**:28728116

#### How I did it

Remove logic for handling control character

#### How to verify it

UT and sonic-mgmt test cases.
vvolam pushed a commit to vvolam/sonic-buildimage that referenced this pull request Sep 12, 2024
#### Why I did it

Dropping control character (message sent when XSUB connects to XPUB as part of ZMQ Proxy setup to notify that subscription has been made) in do capture has been flaky since control character is not guaranteed to be the first message sent if there are events (like event-down-ctr) being published to XSUB.

Scenarios

1) Control character is sent and is first message when starting capture service

`eventd#eventd#eventd: :- heartbeat_ctrl: Set heartbeat_ctrl pause=1`
`eventd#eventd#eventd: :- do_capture: Received subscription message when XSUB connects to XPUB`

2) Events like event-down ctr is sent before control character

`eventd#eventd#eventd: :- run: Dropping Message: 22 serialization::archive 18 17 sonic-events-host`
`eventd#eventd#eventd: :- run: Dropping Message: 22 serialization::archive 18 0 0 4 0 0 0 1 d 103 {"sonic-events-host:event-stopped-ctr":{"ctr_name":"EVENTD","timestamp":"2024-08-27T00:02:51.407518Z"}} 1 r 36 3357542f-bae1-458f-a804-660e620d21f5 1 s 1 9 1 t 19 1724716971407591080`
`heartbeat_ctrl: Set heartbeat_ctrl pause=1`
`do_capture: Received subscription message when XSUB connects to XPUB`

3) Control character is not sent at all

`eventd#eventd#eventd: :- heartbeat_ctrl: Set heartbeat_ctrl pause=1`

4) Control character is delayed and not caught when starting capture service, but is then caught after causing deserialize error.

`do_capture: Receiving event from source: 22 serialization::archive 18 17 sonic-events-host, will read second part of event`
`deserialize: deserialize Failed: input stream errorstr[0:64]:(sonic-net#1) data type: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&`
`zmq_read_part: Failed to deserialize part rc=-2`
`zmq_read_part: last:errno=11`
`zmq_message_read: Failure to read part1 rc=-2`
`zmq_message_read: last:errno=11`

We can cover these scenarios by just dropping the control character inside zmq_message_read as part of events_common in swsscommon (different PR). In this PR we will remove such handling logic and make sure that empty events that will be sent by control character are ignored.

##### Work item tracking
- Microsoft ADO **(number only)**:28728116

#### How I did it

Remove logic for handling control character

#### How to verify it

UT and sonic-mgmt test cases.
qiluo-msft pushed a commit that referenced this pull request Dec 24, 2024
Why I did it
Fixes #21140

In #20024 and sonic-net/sonic-swss-common#906, we made the change that when a control character is read, zmq_message_read will return with rc 0, which will create an empty internal event.

As part of eventd design, empty structured events are dropped, which leads to control characters being a no-op which is the expected behavior.

In UT, we are still always expecting a control character to be the first message to be read by zmq which is not the case as described in #20024. We are also not ignoring empty events that are read which is being done by eventd.

With this change, control characters are properly ignored if it does after the first test event.

How I did it
Ignore empty structured events and not expect first zmq_message_read to be control character.

How to verify it
Manual test/pipeline
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Dec 24, 2024
Why I did it
Fixes sonic-net#21140

In sonic-net#20024 and sonic-net/sonic-swss-common#906, we made the change that when a control character is read, zmq_message_read will return with rc 0, which will create an empty internal event.

As part of eventd design, empty structured events are dropped, which leads to control characters being a no-op which is the expected behavior.

In UT, we are still always expecting a control character to be the first message to be read by zmq which is not the case as described in sonic-net#20024. We are also not ignoring empty events that are read which is being done by eventd.

With this change, control characters are properly ignored if it does after the first test event.

How I did it
Ignore empty structured events and not expect first zmq_message_read to be control character.

How to verify it
Manual test/pipeline
mssonicbld pushed a commit that referenced this pull request Dec 25, 2024
Why I did it
Fixes #21140

In #20024 and sonic-net/sonic-swss-common#906, we made the change that when a control character is read, zmq_message_read will return with rc 0, which will create an empty internal event.

As part of eventd design, empty structured events are dropped, which leads to control characters being a no-op which is the expected behavior.

In UT, we are still always expecting a control character to be the first message to be read by zmq which is not the case as described in #20024. We are also not ignoring empty events that are read which is being done by eventd.

With this change, control characters are properly ignored if it does after the first test event.

How I did it
Ignore empty structured events and not expect first zmq_message_read to be control character.

How to verify it
Manual test/pipeline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants