protocols/kad: accounting error for background job capacity #4191

iand · 2023-07-12T15:09:20Z

Summary

In the kad behaviour poll function the current query capacity is calculated. Then there is a loop over upto num possible provider announcement jobs, possibly starting any that are ready. However a soon as one returns pending the loop is exited via break. Then the current query capacity is decreased by num, however due to the break the number of queries started may be less than num. It seems to me that the capacity should be reduced by the iteration reached in the loop, not the upper bound. (here: https://github.com/libp2p/rust-libp2p/blob/master/protocols/kad/src/behaviour.rs#L2428)

Then the code proceeds to iterate over replication jobs, using the reduced capacity as a limit on the number that may be started.

Expected behaviour

jobs_query_capacity should reflect an accurate capacity for performing replication / publication jobs

Actual behaviour

jobs_query_capacity is always lower than the actual capacity so fewer replication / publication jobs can be run than expected

Would you like to work on fixing this bug?

Maybe, although I have no actual Rust skills 😁

The text was updated successfully, but these errors were encountered:

mxinden · 2023-07-13T09:52:15Z

Thanks Ian for the detailed report!

Indeed, there's a chance that jobs_query_capacity might be reduced by an extra unit in the event that job.poll returns Poll::Pending.

is always lower than the actual capacity

Nit pick. I don't think that is true. I.e. in case all num calls to job.poll return Poll::Ready jobs_query_capacity is accurate. Am I missing something?

Maybe, although I have no actual Rust skills

Contribution would be very much appreciated. Happy to help! Also happy to pair program on this.

On a higher level, I find the throttling mechanism (i.e. JOBS_MAX_QUERIES and JOBS_MAX_NEW_QUERIES) rather brittle. Having a limit per poll invocation does not make sense to me, given that poll invocations might happen on the order of microseconds whereas requests resolve on the order milliseconds, thus the per-poll limit is quickly reached through consecutive invocations.

mxinden · 2023-07-13T09:52:57Z

//CC @dariusc93 in case you are seeing this issue on https://github.com/dariusc93/rust-ipfs.

Fixes: #4191. Pull-Request: #5148.

mxinden added bug help wanted getting-started Issues that can be tackled if you don't know the internals of libp2p very well labels Jul 13, 2023

junekhan mentioned this issue Feb 5, 2024

fix(kad): compute jobs_query_capacity accurately #5148

Merged

mergify bot closed this as completed in #5148 Mar 27, 2024

mergify bot pushed a commit that referenced this issue Mar 27, 2024

fix(kad): compute jobs_query_capacity accurately

89c684a

Fixes: #4191. Pull-Request: #5148.

guillaumemichel pushed a commit that referenced this issue Mar 28, 2024

fix(kad): compute jobs_query_capacity accurately

e5346ec

Fixes: #4191. Pull-Request: #5148.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

protocols/kad: accounting error for background job capacity #4191

protocols/kad: accounting error for background job capacity #4191

iand commented Jul 12, 2023

mxinden commented Jul 13, 2023

mxinden commented Jul 13, 2023

protocols/kad: accounting error for background job capacity #4191

protocols/kad: accounting error for background job capacity #4191

Comments

iand commented Jul 12, 2023

Summary

Expected behaviour

Actual behaviour

Would you like to work on fixing this bug?

mxinden commented Jul 13, 2023

mxinden commented Jul 13, 2023