Reset expiry of entries in the neighbor cache on packet reception #966

korken89 · 2024-08-02T08:51:30Z

Tentative fix for #965

This does solve the issue in my testing, but I'm not sure if it's the "right thing to do".

thvdveld · 2024-08-02T09:30:35Z

I'm not sure either if this is the correct thing to do, I'll have to look into it.

I don't think you need to pass the timestamp every time, you can just call self.now() I think.

korken89 · 2024-08-02T09:31:21Z

Thanks for the pointer, I'll fix!

korken89 · 2024-08-02T09:43:03Z

@thvdveld Another way to solve this is to instead mark the entry as "in use" by a connection to not only evict on "the one with the shortest timeout".
So it would instead be "evict shortest timeout that is not used for a connection" and if all are used by connections then go by the original rule.

This would be even more robust.
If I really push it with this fix applied I can still make the connection unstable.

thvdveld · 2024-08-02T10:07:11Z

From what I understand, Linux [1] does not remove the one with the shortest timeout, they just return an error and drop the packet. However, they have a neighbour cache size of 1024 neighbours. We can also increase the size by using the iface-neighbor-cache-count-1024 feature flag instead of the default iface-neighbor-cache-count-4. (Have you tried this btw?)

What Linux also does is updating the used flag [2], which this PR handles.

Linux keeps track of just more than reachable, etc. [3], which is based on Neighbour Unreachability Detection I think, specified in RFC 4861 Section 7.3. In the end, we should update our implementation to follow this. I know this is for IPv6, but I think they do something similar for IPv4.

I think that for now, updating the expiration timestamp is good enough. What do you think?

[1] https://github.com/torvalds/linux/blob/master/net/core/neighbour.c#L469-L524
[2] https://github.com/torvalds/linux/blob/master/net/core/neighbour.c#L501
[3] https://github.com/torvalds/linux/blob/master/net/core/neighbour.c#L1280-L1490

Dirbaio · 2024-08-02T11:22:19Z

I agree we need some kind of "refreshing" MACs to prevent them from expiring while in active use. I'm not sure what's the best way though. There's historically been a few nasty bugs around this, about bad cache entries in broken networks. For example this 6210612 .

I'd be comfortable with this if we made the check stricter. Update expiration if all these hold:

Dest IP is unicast and ours (ie packet is explicitly directed at us, we ignore broadcast/multicast. It feels to me like these are more likely to be broken, like the above linked issue)
Src IP matches the neighbor cache
Src MAC matches the neighbor cache

I think that should fix your issues while keeping the chances of brokenness low.

The ideal fix would be to get "forward progress" confirmation from higher layer (e.g. TCP ACKs coming in for new data) like Linux does, but that's a more involved refactor.

- Check for unicast destination - Source IP matches cache - Source hardware address matches cache

korken89 · 2024-08-02T13:48:26Z

Thanks for the feedback and analysis! Great to know what the steps towards a complete fix would be.
I think I have implemented all the requested parts you asked for @Dirbaio, have a look at your leisure.

korken89 · 2024-08-02T13:51:24Z

@thvdveld on the comment of iface-neighbor-cache-count-1024, I have not tried this. However I did increase it now after testing these fixes as a small number can still cause issues.
I set it to 32 in my testing, and could go higher.

1024 seems like a lot though, wouldn't that eat like 20kB of RAM?

korken89 · 2024-08-05T08:58:11Z

I've test-run this over night in the old system that broke about every 30s with the old iface-neighbor-cache-count-4.
Now there was almost no failures at all over the ~15 hour test period.

thvdveld · 2024-08-05T09:03:18Z

@thvdveld on the comment of iface-neighbor-cache-count-1024, I have not tried this. However I did increase it now after testing these fixes as a small number can still cause issues. I set it to 32 in my testing, and could go higher.

1024 seems like a lot though, wouldn't that eat like 20kB of RAM?

1024 is indeed a lot when used on embedded. I looked at Contiki-NG and Zephyr, and they do 16 and 8 respectively. Maybe we should make it 8 by default in smoltcp as well.

korken89 · 2024-08-05T09:18:35Z

Yeah, the 4 is a bit too conservative for a good default. Do you have any preference @Dirbaio ?

korken89 · 2024-08-05T09:24:10Z

I tentatively updated the default to 8.

EDIT: Sorry misread the defaults.

Dirbaio

thank you!

korken89 force-pushed the neighbor-cache-expiry-fix branch 3 times, most recently from 5aee679 to 11941a5 Compare August 2, 2024 09:15

Reset expiry of entries in the neighbor cache on packet reception

fd75877

korken89 force-pushed the neighbor-cache-expiry-fix branch from 11941a5 to fd75877 Compare August 2, 2024 09:38

korken89 force-pushed the neighbor-cache-expiry-fix branch from b6a0393 to 756b355 Compare August 2, 2024 13:43

Updated based on dirbaio's comments

a11428d

- Check for unicast destination - Source IP matches cache - Source hardware address matches cache

korken89 force-pushed the neighbor-cache-expiry-fix branch from 756b355 to a11428d Compare August 2, 2024 13:45

Update default neighbor cache size to 8 entries

8e97c95

korken89 force-pushed the neighbor-cache-expiry-fix branch from f008d21 to 8e97c95 Compare August 5, 2024 09:28

Dirbaio approved these changes Aug 5, 2024

View reviewed changes

Dirbaio added this pull request to the merge queue Aug 5, 2024

Merged via the queue into smoltcp-rs:main with commit 8be46b9 Aug 5, 2024
9 checks passed

korken89 deleted the neighbor-cache-expiry-fix branch August 5, 2024 11:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset expiry of entries in the neighbor cache on packet reception #966

Reset expiry of entries in the neighbor cache on packet reception #966

korken89 commented Aug 2, 2024

thvdveld commented Aug 2, 2024

korken89 commented Aug 2, 2024

korken89 commented Aug 2, 2024

thvdveld commented Aug 2, 2024 •

edited

Loading

Dirbaio commented Aug 2, 2024

korken89 commented Aug 2, 2024

korken89 commented Aug 2, 2024

korken89 commented Aug 5, 2024

thvdveld commented Aug 5, 2024

korken89 commented Aug 5, 2024

korken89 commented Aug 5, 2024 •

edited

Loading

Dirbaio left a comment

Reset expiry of entries in the neighbor cache on packet reception #966

Reset expiry of entries in the neighbor cache on packet reception #966

Conversation

korken89 commented Aug 2, 2024

thvdveld commented Aug 2, 2024

korken89 commented Aug 2, 2024

korken89 commented Aug 2, 2024

thvdveld commented Aug 2, 2024 • edited Loading

Dirbaio commented Aug 2, 2024

korken89 commented Aug 2, 2024

korken89 commented Aug 2, 2024

korken89 commented Aug 5, 2024

thvdveld commented Aug 5, 2024

korken89 commented Aug 5, 2024

korken89 commented Aug 5, 2024 • edited Loading

Dirbaio left a comment

Choose a reason for hiding this comment

thvdveld commented Aug 2, 2024 •

edited

Loading

korken89 commented Aug 5, 2024 •

edited

Loading