Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torrent's health zombie records with invalid last check time from the future infect Tribler nodes, spread from machine to machine, and can't be fixed by local health checks #7295

Closed
kozlovsky opened this issue Feb 20, 2023 · 1 comment
Assignees
Milestone

Comments

@kozlovsky
Copy link
Contributor

kozlovsky commented Feb 20, 2023

While reviewing #7286, I discovered a very interesting bug when incorrect health information with a future date and sometimes a huge number of seeders spreads from peer to peer infecting new peers and spreading even further. This becomes possible through a combination of several bugs in TorrentChecker, MetadataStore, and PopularityCommunity.

As I currently understand it, the bug works the following way:

  1. Some Tribler clients' machines have totally incorrect local times. I think some active peer has local time +35 days in the future, and sometimes I also see records with a date several years in the future.

  2. These nodes spread incorrect information to their peers.

  3. On receiving health information from the remote peer, PopularityCommunity calls MetadataStore.process_torrent_health method:

    @db_session
    def process_torrents_health(self, torrent_healths):
        infohashes_to_resolve = set()
        for infohash, seeders, leechers, last_check in torrent_healths:
            added = self.mds.process_torrent_health(infohash, seeders, leechers, last_check)  # <-- here
            if added:
                infohashes_to_resolve.add(infohash)
        return infohashes_to_resolve

and this method stores information in the DB with the only check that the health's last check date should be later than the last check date in the DB.

    def process_torrent_health(self, infohash: bytes, seeders: int, leechers: int, last_check: int) -> bool:
        health = self.TorrentState.get_for_update(infohash=infohash)
        if health and last_check > health.last_check:
            health.set(seeders=seeders, leechers=leechers, last_check=last_check)  # <-- here
            ...

What is important, during this update, the state of the TorrentState.self_checked flag does not alter. That means if a peer checked the health of this torrent at least once in the past, then the health record is considered a locally checked record, even if the info from the remote peer overrode it.

Then, sometime later, the infected Tribler client is restarted. After the restart, PopularityCommunity loads recently checked torrents to spread them to other peers. By "recently checked torrents," it means "self-checked torrents that were checked four hours ago or later":

@db_session
def load_torrents_checked_from_db(self):
    last_fresh_time = time.time() - HEALTH_FRESHNESS_SECONDS
    checked_torrents = list(self.mds.TorrentState
                    .select(lambda g: g.has_data and g.last_check > last_fresh_time and g.self_checked)  # <-- here
                    .order_by(lambda g: (desc(g.seeders), g.last_check))
                    .limit(TORRENTS_CHECKED_RETURN_SIZE))

    for torrent in checked_torrents:
        self._torrents_checked[torrent.infohash] = (torrent.infohash, torrent.seeders, torrent.leechers,
                                                    torrent.last_check)

All invalid torrents with future check dates that could override the local self-checked health suit this condition perfectly. And the correct locally checked health records are loaded only if they are no more than four hours old. As a result, it is pretty usual that a tribler client, when restarted, only loads invalid health records, hundreds of them, with dates a month or a year in the future, and starts spreading them to its peers.

The following bug makes the problem even worse: TorrentChecker has a cache of recently checked torrents, and PopularityCommunity spreads health records contained in this cache. But if the local check reveals a zero number of seeders for a torrent, the cache update is skipped in this case:

    def update_torrents_checked(self, new_result):
        """
        Update the set with torrents that we have checked ourselves.
        """
        infohash = new_result['infohash']
        seeders = new_result['seeders']
        new_result_tuple = (infohash, seeders, new_result['leechers'], new_result['last_check'])

        if seeders > 0:  # <-- here
            self._torrents_checked[infohash] = new_result_tuple

As a result, when the Tribler client starts, it loads hundreds of invalid health records, often with a huge number of seeders, as if they were locally checked. Even if Tribler later checks this torrent and discovers that the actual number of seeders is zero, it does not remove the invalid health record from the cache and continues spreading it to its peers. So invalid health records with huge numbers of peers and time in the future behave like invulnerable zombies that infect other machines and continue spreading with an incorrect huge number of seeders even after the local check reveals zero seeders.


The discovery of this amazing combination of bugs becomes possible after the @drew2a refactoring of TorrentChecker. I fixed it in #7286 with the following fixes:

  1. Invalid health records are rejected and do not "infect" local records (fixed here and here in 99d5725)
  2. Health records just slightly in the future are accepted, as some minor time drift is unavoidable (fixed here in 99d5725).
  3. Valid remote health records do not override local health records if the local check was very recent, as we trust local checks more (fixed here and here in 2bfc9da).
  4. When a valid remote health record overrides the previous local record, it clears the self_checked flag, so the record is not considered by PopularityCommunity as local anymore (fixed here in 8a0eb9e).
  5. If the number of seeders of the new valid local health record is zero, the torrent is removed from the cache of locally recently checked health records, so the local peer stops spreading information about this torrent (fixed here in 943a138).
  6. When TorrentChecker loads locally checked torrents, it now filters out records with invalid last check time (fixed here in 8fea75d)

Also, Tribler now logs information about rejected invalid health records received from other peers, so by looking into the logs, it becomes possible to estimate the severity of the problem and see in real time how many invalid health records Tribler client receives at every moment.

@kozlovsky kozlovsky added this to the 7.13.0 milestone Feb 20, 2023
@kozlovsky kozlovsky self-assigned this Feb 20, 2023
@kozlovsky
Copy link
Contributor Author

Fixed in #7286

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant