Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libteam]: timerfd read() could return 0 fix #3393

Merged
merged 14 commits into from
Aug 29, 2019

Conversation

pavel-shirshov
Copy link
Contributor

- What I did
Fixed an issue with libteam. teamd is using timerfd for LACP timeouts. It's possible to see 0 as return from read() sometimes. I've added a fix, which will not produce an error in this case
- How I did it
As soon as read returns 0 on timerfd, just don't return the error, just skip the event.

- How to verify it
Build an image and run on your DUT

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@renukamanavalan
Copy link
Contributor

read(timerfd...) returning 0 could be handled as if timer fired, because you are reading, which implies that fd is selected for read, which in turn implies that timer has fired.

Something went wrong, between firing & reading, which made it return 0.

@pavel-shirshov
Copy link
Contributor Author

@renukamanavalan we could also say that something went wrong when the event was fired.
So there is nothing to read.

Also I chose this implementation because of https://github.com/facebook/folly/blob/master/folly/experimental/TimerFD.cpp#L83

@lguohan
Copy link
Collaborator

lguohan commented Aug 29, 2019

i tend to agree with renuka, in our case, probably better to treat 0 as timer expired. if you look at the callback, https://github.com/jpirko/libteam/blob/54f137c10579bf97800c61ebb13e732aa1d843e6/teamd/teamd_runner_lacp.c#L812

it is the periodical timeout, if it is expired, we'd better send a lacp packet instead of skipping one timeout.

@yxieca
Copy link
Contributor

yxieca commented Aug 29, 2019

Interesting! So questions are:

Will timer continue to fire if we treat read 0 bytes as fired v.s. ignored?

Will protocol partner ok with receiving more frequent handshakes than expected? (from warm/fast reboot, we know that they would be fine with a late one).

If clock jumped forward, then the timer might have fired earlier than it should be, if clocked jumped backwards, then the timer would have fired late. In the later case, having a catch up handshake might be desirable.

@renukamanavalan
Copy link
Contributor

Based on past debugging, "how you treat the "return == 0", does not affect the subsequent firing of timer."

@pavel-shirshov
Copy link
Contributor Author

@lguohan teamd register 2 timers for LACP (and one for eth_lw if I remember well). One LACP timer is 30 second timer, another one is 120 second timer. If the 30 sec timer expired sooner then expected it's more or less ok. We can send update sooner than 30 second. But if 120 second timer expires sooner than expected, it means drop portchannel down.
In my variant if we miss 30 sec, timer, we have two more chances to send a packet. If we miss 120 sec. timer it gives us a risk of missing LACP is down on the peer.
But to miss that peer shut down it's LACP we should miss:

  1. LACP update packet from the peer
  2. 120 sec. expiration timer.
    So I'd better go with my variant.

@yxieca yxieca merged commit 1c06ae6 into sonic-net:master Aug 29, 2019
yxieca pushed a commit that referenced this pull request Aug 29, 2019
* Update sonic-quagga submodule

* [libteam]: timerfd read() could return 0
@pavel-shirshov pavel-shirshov deleted the pavelsh/timerfd_fix branch August 29, 2019 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants