Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically restart push channel #127

Merged
merged 5 commits into from
Dec 16, 2020
Merged

Automatically restart push channel #127

merged 5 commits into from
Dec 16, 2020

Conversation

dirkmc
Copy link
Contributor

@dirkmc dirkmc commented Dec 11, 2020

Implementation of auto-restart behaviour described in filecoin-project/go-fil-markets#463 (comment)

If a "push" data transfer channel stalls while transferring data, attempt to reconnect to the other party and send a "restart" request for the channel.

Note that the backoff behaviour on dial already exists in the network layer.

This PR adds a pushChannelMonitor to data-transfer manager. Each time the data-transfer manager opens a "push" data transfer channel, it adds the channel ID to the monitor.

Graphsync queues up data to be sent. The pushChannelMonitor periodically checks the amount of data queued against the amount sent. If the amount of pending data (queued - sent), is greater than the configured minimum amount over the configured interval (eg 1MB over 1s), the pushChannelMonitor assumes the transfer has stalled and attempts to send a "restart" request.

impl/integration_test.go Outdated Show resolved Hide resolved
Copy link
Member

@nonsense nonsense left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something you should be aware of:
Graphsync has it's own notion of backpressure -- it will stop queuing when a certain among of memory has been allocated but not sent. We built this to resolve graphsync hogging memory if the network slowed down. So, that means, for this to work as a detection mechanism, the value of the diff between queued & sent must be lower than the backpressure default in graphsync. (I believe 16MB per peer). There aren't any defaults listed here, but I think you should inspect the Graphsync defaults: https://github.com/ipfs/go-graphsync/blob/master/impl/graphsync.go#L32, and your defaults which I assume are in the PR on markets, plus make sure Lotus isn't overriding the graphsync defaults (which would not surprise me)

impl/pushchannelmonitor.go Outdated Show resolved Hide resolved
impl/pushchannelmonitor.go Outdated Show resolved Hide resolved
@hannahhoward
Copy link
Collaborator

Follow-up: One thing just to factor for the future is that while retrievals seem to be almost no one's priority, we're gonna have to deal with this problem eventually for Pulls too. And, while the ipfs/go-graphsync#129 might help, we may also want to do some kind of keep alive in go-graphsync? I guess maybe go-yamux does that already.

@dirkmc
Copy link
Contributor Author

dirkmc commented Dec 15, 2020

@hannahhoward for pulls I'm not sure we need this mechanism, I think it may make more sense to rely on go-graphsync itself to do retries. go-graphsync has some retry capability but it would be nice if it worked the same way as the retries in go-fil-markets and go-data-transfer

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- just keep an eye out for the back pressure issue when you get to LOTUS

@dirkmc
Copy link
Contributor Author

dirkmc commented Dec 16, 2020

I realized that the data-rate monitoring mechanism I implemented was not very granular. If for example the pending data suddenly increased right before checking the data rate, it would appear that the data rate was too low and the monitor would trigger a channel restart.

The latest commit instead adds functionality such that:

  • the data-rate is checked and recorded multiple (configurable) times per interval, eg 10
  • each check compares the sent amount between the start and end of the interval against the minimum required data rate

@nonsense
Copy link
Member

The new algorithm looks correct to me.

@dirkmc dirkmc merged commit 288413b into master Dec 16, 2020
@dirkmc dirkmc deleted the feat/push-auto-retry branch December 16, 2020 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants