Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/Prepare CConnman for upcoming epoll support #3432

Merged
merged 8 commits into from
Apr 19, 2020

Conversation

codablock
Copy link

@codablock codablock commented Apr 17, 2020

This is a set of refactoring which prepare us for the upcoming epoll support. I plan to use epoll in "edge triggered" (vs. level triggered) mode to reduce the number of events as much as possible.

Even though select/poll don't know about the concept of edge/level triggered modes, one can view/compare the behavior of select/poll with level triggered mode. This means, if a socket is given that has pending read data, select/poll will always immediately return even if the data is not fully read from the socket. Same when waiting for sockets to become sendable, select/poll will always return immediately until the socket is not sendable anymore. The old CConnman code relied on this behavior and thus always passed all sockets into select/poll, even if it did not drain the receive buffers.

Edge triggered mode (only epoll supports it) works differently. When it signals that a socket is readable, it won't ever re-signal for that same socket until the receive buffer is drained. Same with sendable sockets. This behavior would immediately lead to epoll blocking forever when one iteration of SocketHandler() doesn't drain the read buffer.

The refactorings in this PR change SocketHandler() and friends to not assume that select/poll would return immediately if the last recv did not drain buffers. This is done by keeping track of which peers are receivable/sendable and then only selecting/polling for peers which were are not receivable/sendable. The receivable/sendable state is then reset accordingly when read buffers get drained or send buffers get full (read()/send() return less then accepted).

With these refactorings, implementing epoll support with edge triggered mode gets a no-brainer.

@codablock codablock added this to the 16 milestone Apr 17, 2020
Copy link
Member

@PastaPastaPasta PastaPastaPasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK, obviously backports scare me, but it doesn't look too bad, like you've said.

Regarding the network side, I don't know too much, so the best I'm able to say is this diff probably doesn't send my private keys to a random server 😂

I will test this later today and report back how it goes

Oh, also, a nit and a thought

nBytes = recv(pnode->hSocket, pchBuf, sizeof(pchBuf), MSG_DONTWAIT);
}
if (nBytes > 0)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, since we moved, we technically should fix formatting

src/net.h Show resolved Hide resolved
@PastaPastaPasta
Copy link
Member

I got two test failure locally, not sure if it's just me. https://gist.github.com/PastaPastaPasta/517ebecd2566fe98e77832bed3cd8c28

@PastaPastaPasta
Copy link
Member

rerunning sendheaders.py and llmq-chainlocks.py a bunch of times, I'm unable to reproduce the error.

@PastaPastaPasta
Copy link
Member

Also see https://gitlab.com/PastaPastaPasta/dash/pipelines/137287226 where I am getting (different) test failures on both 32 bit and 64 bit

@codablock codablock force-pushed the pr_refactor_sockethandler branch from e97495e to a0832e3 Compare April 17, 2020 20:14
@codablock
Copy link
Author

Rebased on develop to fix conflicts.

@PastaPastaPasta The sendheaders.py failures are sporadically on develop as well, so these are not related to this PR. Never seen the llmq-chainlocks.py failure however, so unsure if related. If you see it again, can you also upload all logs?

One of the failures (p2p-fullblocktest.py) in https://gitlab.com/PastaPastaPasta/dash/pipelines/137287226 is due to out of memory. I can see that you're running it on free Gitlab runners, these are not powerful enough to run so many tests in parallel. The other failure seems to be something different...investigating.

@PastaPastaPasta
Copy link
Member

@codablock
Copy link
Author

@PastaPastaPasta I need all debug.log files from all nodes to debug the llmq-chainlocks.py failure.

Meanwhile I found the reason for the failure in llmq-simplepose.py and added a commit to fix it.

@codablock codablock force-pushed the pr_refactor_sockethandler branch from 43c67fe to e944387 Compare April 17, 2020 21:20
@codablock
Copy link
Author

@PastaPastaPasta The llmq-chainlocks.py failures were unrelated and should be fixed via #3435

@UdjinM6
Copy link

UdjinM6 commented Apr 18, 2020

Needs rebase

@codablock codablock force-pushed the pr_refactor_sockethandler branch from e944387 to a742ce5 Compare April 18, 2020 10:41
UdjinM6
UdjinM6 previously approved these changes Apr 18, 2020
Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

@UdjinM6 UdjinM6 dismissed their stale review April 18, 2020 12:43

Actually, wait. I have random test failures locally. Will test a bit.

@PastaPastaPasta
Copy link
Member

PastaPastaPasta commented Apr 18, 2020

I got an p2p-instantsend.py failure (and an abandonconflict.py failure likely fixed by #3436) see https://gist.github.com/PastaPastaPasta/cb30bb821d36432b455242545e443279. Can provide full datadir if needed. Unable to reproduce this one, might have been caused by my computer timing out of something

This allows us to check for pending messages without locking cs_vSend
Instead of selecting every socket in every SocketHandler iteration, we will
now track which nodes are known to have pending receivable data and/or
have empty send buffers.

Each time recv fails to fill a whole receive buffer, fHasRecvData is
set to false so that the socket is added to the receive select set
in the next iteration. When that socket is signalled through select/poll,
fHasRecvData is set to true again and remains true until a future recv
fails.

Each time send fails to send a full message, fCanSendData is set to false
so that the socket is added to the send select set in the next iteration.

At the same time, nodes which have pending messages to send are tracked
in mapNodesWithDataToSend, so that SocketHandler knows for which nodes
SocketSendData must be invoked.
@codablock codablock force-pushed the pr_refactor_sockethandler branch from a742ce5 to 47af42a Compare April 18, 2020 19:05
Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it was actually #3433 that cause issues for me and not this PR itself. #3444 should fix that.

ACK

@codablock
Copy link
Author

@UdjinM6 do you have logs/datadirs of these failures? Would be interesting to know why a timeout of 1000mn was an issue?

@UdjinM6
Copy link

UdjinM6 commented Apr 19, 2020

Not much to see in logs, just a bunch of socket select error Invalid argument (22) and no connections.

@UdjinM6
Copy link

UdjinM6 commented Apr 19, 2020

It probably has smth to do with ... kernel scheduling delays mean that the blocking interval may overrun by a small amount. http://man7.org/linux/man-pages/man2/select.2.html and https://github.com/dashpay/dash/blob/develop/src/net.cpp#L93-L94.

@codablock codablock merged commit f5f4ccb into dashpay:develop Apr 19, 2020
@codablock codablock deleted the pr_refactor_sockethandler branch April 19, 2020 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants