Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEX - sharing peers with othes #261

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

izderadicka
Copy link
Contributor

For PEX only receiving part was implemented, this is an attempt also to send PEX extended message to connected peers.

It's WIP now, there is a problem with E2E test , which is unstable now, often ends with :

---- tests::e2e::test_e2e_download stdout ----
[crates/librqbit/src/tests/e2e.rs:65:9] tempdir.path() = "/tmp/rqbit_e2eYmBFKD"
thread 'tests::e2e::test_e2e_download' panicked at crates/librqbit/src/tests/test_util.rs:213:6:
called `Result::unwrap()` on an `Err` value: wait_until timeout: last result = Some(metrics.num_alive_tasks() = 0, expected 1)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

which does not make sense to me - how could be number of live tasks in runtime 0?

But there is also another interesting behavior of e2e test if I change initial delay on PEX send to 0 (here https://github.com/izderadicka/rqbit/blob/82914bac27226b1d10157832ee66aeb7af67ca18/crates/librqbit/src/torrent_state/live/mod.rs#L807), then:

  1. Sometime some server tasks did not end (5 tasks per server) - as if server session was not dropped - it's kind of arbitrary - message same as above but now number of task is 5 or 10, or n x 5 - depends (max I seen was 80).
  2. Servers send PEX msg between themselves - which is strange - as their torrent is fully downloaded - so they should not open new connections and peers received from client should just be stored as not_needed , but should not initiate connection try, at least this is how I understand the code.

@izderadicka
Copy link
Contributor Author

Actually I see same problems also on main branch - can you try on main

#!/usr/bin/env bash

# end on sigint
STOP=""
trap "STOP=1" INT


NUM_RUNS=${1:-100}
TOTAL=$NUM_RUNS
FAILURES=0
export NO_COLOR=1
rm /tmp/librqbit_e2e_failures.log
while [ $NUM_RUNS -gt 0 ]; do
    echo Running test $((TOTAL - NUM_RUNS + 1)) of $TOTAL
    NUM_RUNS=$((NUM_RUNS - 1))
    cargo test --package librqbit  tests::e2e::test_e2e_download > /tmp/librqbit_e2e_run.log
    res=$?
    if [ $res -ne 0 ]; then
        FAILURES=$((FAILURES + 1)) 
        echo "FAILURE with code $res" >> /tmp/librqbit_e2e_failures.log
        tail -50 /tmp/librqbit_e2e_run.log >> /tmp/librqbit_e2e_failures.log
        echo "----------------------------------------------" >> /tmp/librqbit_e2e_failures.log

    fi
    if [ "$STOP" = "1" ]; then
        break
    fi
done
rm /tmp/librqbit_e2e_run.log
echo "Total runs : $TOTAL, failures : $FAILURES"
echo "Failures log /tmp/librqbit_e2e_failures.log" 

On main branch I got 28 failures out of 100 - mainly above mentioned 0 tasks, but couple of dropcheck failures.

@ikatson
Copy link
Owner

ikatson commented Oct 27, 2024

Maybe some tokio upgrade broke it, I'll take a look someday. Just got back from a 10 day break, so catching up.

let mut sent_peers_live: HashSet<SocketAddr> = HashSet::new();
const MAX_SENT_PEERS: usize = 50; // As per BEP 11 we should not send more than 50 peers at once (here it also applies to fist message, should be OK as we anyhow really have more)
const PEX_MESSAGE_INTERVAL: Duration = Duration::from_secs(60); // As per BEP 11 recommended interval is min 60 seconds
let mut delay = Duration::from_secs(10); // Wait 10 seconds before sending the first message to assure that peer will stay with us
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio::time::interval would be cleaner as it serves the same purpose as sleeping in a loop but is a bit nicer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main reasons for using sleep:

  • initial wait - do not want to start immediately after peer connection, as peer can quickly disconnect. Only want to send PEX to 'stable' peers. interval will fire immediately.
  • changing wait period - now it's 10, then 60 secs. But can be extended - I was thinking about some extension of the interval - up to 600 secs or something like this.

From this perspective sleep seems easier to use.

crates/librqbit/src/torrent_state/live/mod.rs Outdated Show resolved Hide resolved
None
}
})
.take(50)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for the initial PEX message the combined amount of added v4/v6 contacts should not exceed 50 entries. The same applies to dropped entries.

Afaiu this implies that on the next iteration you'll get the next 50 peers, but in your code you're only looking at first 50 live, and you'll never look at the ones past the first 50.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also condition && !sent_peers_live.contains(addr) on line 834, which filters out already sent addresses - so what I got are next 50, which have not been sent to peer - I looked on it in swarm on busy torrent (Ubuntu) and it seemed do work.
Anyhow this is also related to the other comment, so it may change.

.filter(|addr| {
self.peers
.states
.get(addr)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're locking the state twice, and it may race - e.g. it'll be live in first pass where you are computing "addrs_live_to_sent" and then not live later, or the other way around.

It's hard to reason about what will happen in this case, esp. when below you're trying to provide strong guarantees like "it's assured by mutual exclusion of two above sets".

So overall, how about refactor this for simplicity:

  1. store live peer's view in "sent_peers_live" as you are doing now.
  2. for each loop iteration
    1. collect all live addrs into HashSet (not bounded by 50, but bounded by the semaphore anyway, 128 if I remember correctly). Let's name it "all_live"
    2. connected = (all - sent).take(50)
    3. dropped = (sent - all).take(50)
    4. sent = sent + connected - dropped

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - but I think it's not the case here because liveness check actually happens only once, depending on content of sent_peers_live - which reflects already sent peers addresses:

  • if address is in sent_peers_live, then its effectively checked for liveness only on line 853 .map(|p| !p.value().state.is_live()) (for closed peers), because of line 834 && !sent_peers_live.contains(addr) which excludes it ultimately from addrs_live_to_sent set.
  • if address is not in sent_peers_live - it's checked only for addrs_live_to_sent on line 832 if peer.state.is_live() and second check for closed peers is not happening at all, because it's in iteration over sent_peers_live.

So in any case in each loop run only one liveness check happens, which is relevant for logic.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main point was that the logic is hard to read, and I proposed a simpler way to write it. All the state that you need to keep in your mind to understand why the liveness check is happening only once might serve as proof of that :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that logic is difficult - it's just all about sent_peers_live, which we need anyhow - if address is not in it, we check if we can add new address, which is live and not in it, if address is in it, we check if we should remove if it's not live. That's it.

Proposed change will work exactly same, I think. Only we will need always to collect all live peers, which is not "needed", we just need those, that are not in sent_peers_live.

crates/librqbit/src/torrent_state/live/mod.rs Outdated Show resolved Hide resolved
@ikatson
Copy link
Owner

ikatson commented Nov 6, 2024

As for flaky tests, it seems stable now on my box (without extensive testing). The fixes are here:
e371522
87e09a6

Can you check if it's still bad for u?

@izderadicka
Copy link
Contributor Author

As for flaky tests, it seems stable now on my box (without extensive testing). The fixes are here: e371522 87e09a6

Can you check if it's still bad for u?

Checked on latest main and no problem, only in one run test complained about Blocking waiting for file lock on package cache 3 times, but test succeed then - but at that time my notebook was heavily loaded but these tests and some other work, so I guess that was the reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants