Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag based content discovery and crowdsourcing #6214

Closed
ichorid opened this issue Jul 9, 2021 · 41 comments
Closed

Tag based content discovery and crowdsourcing #6214

ichorid opened this issue Jul 9, 2021 · 41 comments
Assignees

Comments

@ichorid
Copy link
Contributor

ichorid commented Jul 9, 2021

After a thorough discussion, we came to the following architecture for the Tags system:

  • every metadata entry should have one or several tags
  • the tag assigning system maintains a local database of tags, based on frequencies
  • whenever a user starts a download (from a channel or from an external world), if the download has less than 5 tags, Tribler gently asks the user to add some tags (from suggested tags)
  • if the user suggests a change to a torrent's tags, a PR into the owner's channel is automatically created and propagated upstream
  • the tag selection system should be bootstrapped from some collective knowledge pool, or from other users
  • we should first create some wireframes to see how tags adding and search look like
  • the main reason for tags system is deduplication and gentle nudging of users to organize in collectives
@grimadas
Copy link
Contributor

grimadas commented Jul 9, 2021

Comments regarding the tag-based organisation and why need this in the first place:

Why tag-centric organisation?

  • A more natural way to organize people not around channel but around real interest.
  • Most channels are GIGA, hard to moderate and even look at the huge amount of content. Too much dependency on the channel owner.
  • How many are using channel feature? Are they browsing even?

What is tag-centric even?

Tag-centric organization is people, content, metadata.

  • Content. A content that is associated with this tag, e.g. videos, audio, text data etc.
  • Metadata. Forums, chats and structure, labels etc.
  • People. People are crucial part of this organisation. They might be consumers, contributors, servers or moderators. There are multiple roles.

The vision for Tribler:

  • Let's give more tools for people to organise communities based on tags.
  • Metrics: engagement, do users use features, the tag-communities are available etc.
  • We decrease time to create community. They will organise faster and with bigger scale.

@synctext
Copy link
Member

synctext commented Jul 13, 2021

This issue re-does the work already conducted 14 years ago. But then it failed.
This is a big task, beyond a single developer. Read about this architecture full article:
image
The exact hard-core science details:
image
Complete phd thesis of 202 pages with many details. Key piece that is missing: the post-mortem of why this failed to move beyond prototyping. Sadly that's only in my head.

Our 'recent' tag work of only 10 years ago. High performance implementation of tag-based discovery: https://dl.acm.org/doi/abs/10.1145/2063576.2063852
image

@drew2a
Copy link
Contributor

drew2a commented Sep 9, 2021

Draft Wireframes for tags (updated 10.09.21):

Tags drawio

Source: https://drew2a.notion.site/Tags-c1567365a1c94ce78271257f0aa19b06

@drew2a drew2a modified the milestones: Next-next release, Backlog Sep 15, 2021
@synctext
Copy link
Member

Solid progress.
The tag suggestion box is indeed the hardest to make. In years it needs to evolve to understand/expand synonyms like stackoverflow. https://stackoverflow.com/tags/synonyms?tab=Renames&dir=Descending&filter=Active

@drew2a drew2a changed the title Tags system for Channels Tags System Sep 21, 2021
@ichorid ichorid assigned devos50 and unassigned ichorid Sep 21, 2021
@devos50
Copy link
Contributor

devos50 commented Sep 27, 2021

Trying out something new here. When it comes to interaction design, it is common to work with user stories that help to think about the user. Even after writing the simple user stories below, I feel that they are very helpful in getting rid of the bias towards the developer that we most likely all have. Feel free to improve/give feedback on the following user stories.

Personas

I can think of the two personas:

  • Persona 1. A user uses Tribler to (anonymously) download the content they want. This user has fetched a torrent or a magnet link from an external (non-Tribler) source.
  • Persona 2. A user uses Tribler to find some content they might be interested in.
    These personas are not mutually exclusive - a user might simultaneously download some content using Tribler and search for other content.

Epic

Within this issue, we focus on persona 2 since the goals of persona 1 are addressed by different components. Tribler currently has a subpar user experience when it comes to finding and recommending content. As discussed before, we want to see if tags are able to improve this situation. As overarching goal, Tribler should be extended with the following functionality:

  • Users should be able to view the tags assigned to a particular piece of content.
  • Users should be able to quickly find all content tagged with a particular word, e.g., by using the search bar or by clicking on a particular tag.
  • Tribler should recommend users content they might like.

User stories

For a minimal version of tags, I see the following three user stories:

  1. As a Tribler user, I want to view the tags associated with a content item, so that I can get an impression what the content is about.
  2. As a Tribler user, I want to add missing tags from particular content, so that I can contribute to the content catalogue and improve it.
  3. As a Tribler user, I want to remove inaccurate tags from particular content, so that I can contribute to the content catalogue and improve it.
  4. As a Tribler user, I do not want to see profane tags when the family filter is enabled, so that <well, this one is pretty obvious I guess>.
  5. As a Tribler user, I want to disable the entire tag system, so that I can hide tags when I believe they do not add much value.

Note: each user story should be clear, feasible, and testable, also see this article.
Note 2: These user stories exclusively focus on tag visibility and moderation, and exclude stories for tag navigation and searching

@devos50
Copy link
Contributor

devos50 commented Sep 28, 2021

After some discussions and mock-ups, here's a GUI preview of the resolution of user story 1:

Schermafbeelding 2021-09-28 om 15 31 01

Note that I use the "GUI test mode" for prototyping so the tags/titles do not make sense yet. I'm hovering over the 'edit' button in the first row. Clicking on the pencil will bring up a dialog where a user can suggest/remove tags (we reached majority consensus on using a dialog), but that dialog is not ready yet. Color scheme/margins/paddings/sizes have not been finalized yet.

@devos50
Copy link
Contributor

devos50 commented Sep 29, 2021

We made a few design decisions:

  • For the first version, we will not allow spaces in tags. The main arguments here are mostly code simplicity, and the fact that users are custom to add a new tag by typing a space in an input field.
  • We will change existing queries, and return tags in a separate tags field as part of the to_simple_dict conversion of TorrentMetadata.
  • Tags will be stored in a separate table in the metadata.db database, and uniquely referenced to (just like torrent health).
  • We also decided to include a separate button in the settings pane to hide tags, if the functionality gets abused (although we should build in a few basic countermeasures to prevent abuse).

@devos50
Copy link
Contributor

devos50 commented Sep 29, 2021

To resolve user stories 1 and 2, we made the following dialog (still very subject to change):

Schermafbeelding 2021-09-29 om 16 34 41

Update: added an additional setting to disable tag visibility (checkbox is disabled by default):

Schermafbeelding 2021-09-30 om 15 57 14

@synctext synctext changed the title Tags System Tag based content discovery and crowdsourcing Sep 30, 2021
@synctext
Copy link
Member

synctext commented Sep 30, 2021

Something to think about. Do we want to build a community of "taggers" after launch of 7.11? Or let our users know using TorrentFreak after a few months of more iterations and improvements.
The following Creative Commons music tags are available for bootstrap purposes (see other Tribler music issues):

@devos50
Copy link
Contributor

devos50 commented Oct 1, 2021

Do we want to build a community of "taggers" after launch of 7.11? Or let our users know using TorrentFreak after a few months of more iterations and improvements.

Building a community would be great, but would probably require more work beyond a minimal version (e.g., making the contributions of a particular user visible). So let's first iterate and improve the current system.

@drew2a
Copy link
Contributor

drew2a commented Oct 5, 2021

Design decisions behind the DB:

  1. We will use a separate database: https://github.com/drew2a/tribler/blob/feature/tag_community/src/tribler-core/tribler_core/components/tag/db/tag_db.py
  2. We will store not only tags but also user operations on tags (only one operation per one user):
        class TorrentTagOp(db.Entity):
            id = orm.PrimaryKey(int, auto=True)

            torrent_tag = orm.Required(lambda: TorrentTag)
            peer = orm.Required(lambda: Peer)

            operation = orm.Required(int)
            time = orm.Required(int)
            signature = orm.Required(bytes)
            updated_at = orm.Required(datetime.datetime, default=datetime.datetime.utcnow)

            orm.composite_key(torrent_tag, peer)
  1. We will use two counters to access information about added and removed tags without aggregation:
        class TorrentTag(db.Entity):
            ...
            added_count = orm.Required(int, default=0)
            removed_count = orm.Required(int, default=0)
  1. We will use an indicator to distinguish local user operations and remote user operations:
        class TorrentTag(db.Entity):
            ...
            local_operation = orm.Optional(int) 

cc: @kozlovsky

@ichorid
Copy link
Contributor Author

ichorid commented Oct 5, 2021

@kozlovsky , wouldn't using a separate DB make it impossible to do complex queries involving both Metadata store data and Tags data?

@kozlovsky
Copy link
Contributor

@ichorid I think that with a separate tag database the development of an initial version of the tag-based system may be easier. Regarding queries, with our current approach for FTS search, it should be no difference between a single database and two separate databases. If it would be necessary, we can combine databases later, or even just attach the tag database to the metadata store DB.

@devos50
Copy link
Contributor

devos50 commented Oct 6, 2021

Tag Reinforcement

Not sure if the suggestion below is applicable/suitable for the first version, but it is open for discussion.

Problem: To address the most trivial poisoning attacks, we decided that a particular tag will only be displayed when two identities have suggested it (thresholding). However, the chance that two users independently come up with the same tags for the same content is rather low. Even with a threshold of 2, I predict that much content will remain visibly untagged.

Potential solution: We can help the user by showing tags that have been suggested by other users but don't have enough support yet (i.e., haven't reached the threshold). This indication (e.g., "Suggestions: X, Y, Z" or "Suggested by others: A, B, C") should be part of the dialog where a user can add/remove tags, for example, below the input field. To prevent visual clutter, we should limit the number of suggestions shown.

@qstokkink
Copy link
Contributor

If you need inspiration, there have been some academic works that look at the tag reinforcement of user-generated tags in the Steam Tags system (e.g., http://dx.doi.org/10.1145/3377290.3377300).

@synctext synctext mentioned this issue Oct 7, 2021
5 tasks
@synctext
Copy link
Member

synctext commented Feb 8, 2022

Dataset for tagging: https://github.com/MTG/mtg-jamendo-dataset
55,525 tracks annotated by 87 genre tags, 40 instrument tags, and 56 mood/theme tags

@devos50
Copy link
Contributor

devos50 commented Sep 21, 2022

The current version of tags has been running successfully for a few months now, and we have seen several tags that have been created by different users. As the next step, we want to use these tags and our existing infrastructure to improve the search experience. Concretely, our first goal is to identify and bundle torrents that describe similar content (for example: ubuntu-22.04-desktop-amd64.iso and ubuntu-22.04.1-desktop-amd64.iso). To describe relations between entities, we might want to add an additional 'relation' field to each tag. This requires modifications to the tags networking protocol and database scheme.

Our upcoming improvements are also a key step towards readying our infrastructure to build and maintain a global knowledge graph. This knowledge graph can act as fundamental primitive for upcoming science in the domain of content search, content navigation, and eventually content recommendation.

@devos50
Copy link
Contributor

devos50 commented Oct 18, 2022

A preview of the 'edit metadata' screen that can be used to edit the metadata of particular torrent files. For this first iteration, users can edit the title, description, year and language fields.

196442102-16ef039f-49b6-42b2-a4a0-8a6061cdedcb

@synctext
Copy link
Member

library science knowledge - related work. The manifestation versus item abstraction plus tagging.

@devos50 devos50 removed their assignment Feb 20, 2023
@synctext
Copy link
Member

synctext commented Feb 23, 2023

"Justin Bieber is gay" scientific problem - tag spam

MeritRank is needed to fix this spam issue in the Tribler future. Fans and fame of artists also attracts Internet trolls. We have in the past cofounded the Musicbrainz music metadata library. This crowdsourcing library has a unique dataset of votes on tags with explicit spam. See the gay and black metal spam on artist Justin Bieber, (click to expand spam).

Genres votes
pop 16
teen pop 8
dance-pop 0
hip hop 0
tropical house 0
black metal -1
christmas music -1
contemporary r&b -1
electropop -1
Other tags
2010s 1
2020s 1
awful 0
english 0
terrible 0
gay -2
amazing -3

Bieber has a profile page. Next step in our semantic search roadmap is modelling the split between concept and materialisation. The knowledge graph should contains both types of entries. See the 1994 early scientific beginnings of
semantic data models with the 1) logic perspective and 2) object perspective

solution: gossip, signals, and reputation. Simple central reputation system of central profiles
gihtub_reputation_synctext_A

Publication venue: https://www.frontiersin.org/research-topics/19868/human-centered-ai-crowd-computing or
https://www.frontiersin.org/research-topics/19653/user-modeling-and-recommendations

@synctext
Copy link
Member

synctext commented Apr 6, 2023

Dev meeting brainstorm outcome: Martijn has/had a crawler running with Tag-crowdsourcing. Check status @drew2a and 1-day dataset analysis with live "remove tag" within Tribler 7.13 release?

@drew2a
Copy link
Contributor

drew2a commented May 25, 2023

Crawler status is:

   Active: active (running) since Thu 2022-09-29 08:44:03 UTC; 7 months 25 days ago

Tags DB analysis

Dataset

select count(*)
from Peer; --408

select count(*)
from Torrent; --12842

select count(*)
from Tag; --4862

select count(*)
from TorrentTag; --17312

select count(*)
from TorrentTagOp; --17670

Add vs Remove operations count

select count(*), TorrentTagOp.operation 
from TorrentTagOp 
group by operation
Count Operation
17298 Add
372 Remove
image

Operations' distribution among peers

select count(*), TorrentTagOp.operation, TorrentTagOp.peer
from TorrentTagOp
group by operation, peer
image

Normalized:

image

Peers grouped by operation:

peers_add = {peer for _, operation, peer in data if operation == 1}
peers_remove = {peer for _, operation, peer in data if operation == 2}
add_and_remove = peers_add.intersection(peers_remove)

print(f'Add and Remove: {len(add_and_remove)}')  # 77
print(f'Only Add: {len(peers_add.difference(add_and_remove))}')  # 305
print(f'Only Remove: {len(peers_remove.difference(add_and_remove))}')  # 26
image

@drew2a
Copy link
Contributor

drew2a commented Jan 16, 2024

To describe the current state of the KnowledgeCommunity, I will start with a description of the database.

Database

The full schema is available at
https://github.com/Tribler/tribler/blob/main/src/tribler/core/components/database/db/layers/knowledge_data_access_layer.py

It describes Statements that are linked to a Peer (through a digital signature) who created that statement. To anonymize the peer, not their main public key is used, but a secondary key as described here:

# secondary key:
secondary_private_key_path = config.state_dir / config.trustchain.secondary_key_filename
self.secondary_key = self.load_or_create(secondary_private_key_path)

In the KnowledgeGraph, a Statement is an edge with the following (simplified) structure:

@dataclass
class SimpleStatement:
subject_type: ResourceType
subject: str
predicate: ResourceType
object: str

Where ResourceType is

class ResourceType(IntEnum):
""" Description of available resources within the Knowledge Graph.
These types are also using as a predicate for the statements.
Based on https://en.wikipedia.org/wiki/Dublin_Core
"""
CONTRIBUTOR = 1
COVERAGE = 2
CREATOR = 3
DATE = 4
DESCRIPTION = 5
FORMAT = 6
IDENTIFIER = 7
LANGUAGE = 8
PUBLISHER = 9
RELATION = 10
RIGHTS = 11
SOURCE = 12
SUBJECT = 13
TITLE = 14
TYPE = 15
# this is a section for extra types
TAG = 101
TORRENT = 102
CONTENT_ITEM = 103

Statement examples:

SimpleStatement(subject_type=ResourceType.TORRENT, subject='infohash1', predicate=ResourceType.TAG, object='tag1')
SimpleStatement(subject_type=ResourceType.TORRENT, subject='infohash2', predicate=ResourceType.TAG, object='tag2')
SimpleStatement(subject_type=ResourceType.TORRENT, subject='infohash3', predicate=ResourceType.CONTENT_ITEM, object='content item')

Due to the inherent lack of trust in peers, we cannot simply replace an existing statement with a newly received one. Instead, we store all statement-peer pairs. This approach allows for the potential to downvote or upvote all statements associated with a particular peer, depending on whether that peer becomes trusted or not. Currently, we lack a reliable trust function, and this design was implemented with the anticipation that such a function would be developed in the future. This assumption has proven to be correct, as the design is compatible with the MeritRank. Furthermore, it appears that all necessary data is already present within the KnowledgeGraph, making it well-suited for future integration with trust evaluation mechanisms.

There are two operations available for peers:

class Operation(IntEnum):
""" Available types of statement operations."""
ADD = 1 # +1 operation
REMOVE = 2 # -1 operation

All operations are recorded in the database, allowing for the calculation of the final score of a specific operation based on the cumulative actions taken by all peers. This approach enables a comprehensive assessment of each operation's overall impact within the network.

Currently, a simplistic approach is employed, which involves merely summing all the 'add' operations (+1) and subtracting the 'remove' operations (-1) across all peers. This method is intended to be replaced by a more sophisticated mechanism, the MeritRank merit function.

@property
def score(self):
return self.added_count - self.removed_count

ER diagram

erDiagram
    Peer {
        int id PK "auto=True"
        bytes public_key "unique=True"
        datetime added_at "Optional, default=utcnow()"
    }

    Statement {
        int id PK "auto=True"
        int subject_id FK
        int object_id FK
        int added_count "default=0"
        int removed_count "default=0"
        int local_operation "Optional"
    }

    Resource {
        int id PK "auto=True"
        string name
        int type "ResourceType enum"
    }

    StatementOp {
        int id PK "auto=True"
        int statement_id FK
        int peer_id FK
        int operation
        int clock
        bytes signature
        datetime updated_at "default=utcnow()"
        bool auto_generated "default=False"
    }

    Misc {
        string name PK
        string value "Optional"
    }


    Statement }|--|| Resource : "subject_id"
    Statement }|--|| Resource : "object_id"
    StatementOp }|--|| Statement : "statement_id"
    StatementOp }|--|| Peer : "peer_id"
Loading

@drew2a
Copy link
Contributor

drew2a commented Jan 17, 2024

The next chapter is dedicated to the community itself.

Community

https://github.com/Tribler/tribler/blob/main/src/tribler/core/components/knowledge/community/knowledge_community.py

The algorithm of the community's operation:

  1. Every 5 seconds, we request 10 StatementOperations from a random peer.

@dataclass
class StatementOperation:
"""Do not change the format of the StatementOperation, because this will result in an invalid signature.
"""
subject_type: int # ResourceType enum
subject: str
predicate: int # ResourceType enum
object: str
operation: int # Operation enum
clock: int # This is the lamport-like clock that unique for each quadruple {public_key, subject, predicate, object}
creator_public_key: type_from_format('74s')

  1. Upon receiving a response, we verify the signatures of the operations and their validity.

def verify_signature(self, packed_message: bytes, key: Key, signature: bytes, operation: StatementOperation):
if not self.crypto.is_valid_signature(key, packed_message, signature):
raise InvalidSignature(f'Invalid signature for {operation}')

def validate_operation(operation: StatementOperation):
validate_resource(operation.subject)
validate_resource(operation.object)
validate_operation(operation.operation)
validate_resource_type(operation.subject_type)
validate_resource_type(operation.predicate)

  1. When StatementOperations are requested from us, we select N random operations (the number of operations is specified in the request) and return them.

Autogenerated Knowledge

In addition to the user-added knowledge statements, there is also auto-generated statements. The KnowledgeRulesProcessor was developed for the automatic generation of knowledge, which analyzes the records in the database and generates knowledge based on predefined regex patterns found in them.

For example here is a definition of autogenerated tags:

# Each regex expression should contain just a single capturing group:
square_brackets_re = re.compile(r'\[([^\[\]]+)]')
parentheses_re = re.compile(r'\(([^()]+)\)')
extension_re = re.compile(r'\.(\w{3,4})$')
delimiter_re = re.compile(r'([^\s.,/|]+)')
general_rules: RulesList = [
Rule(
patterns=[
square_brackets_re, # extract content from square brackets
delimiter_re # divide content by "," or "." or " " or "/"
]),
Rule(
patterns=[
parentheses_re, # extract content from brackets
delimiter_re # divide content by "," or "." or " " or "/"
]),
Rule(
patterns=[
extension_re # extract an extension
]
),
]

This is a definition of Ubuntu, Debian and Linux Mint content items.

space = r'[-\._\s]'
two_digit_version = r'(\d{1,2}(?:\.\d{1,2})?)'
def pattern(linux_distribution: str) -> Pattern:
return re.compile(f'{linux_distribution}{space}*{two_digit_version}', flags=re.IGNORECASE)
content_items_rules: RulesList = [
Rule(patterns=[pattern('ubuntu')],
actions=[lambda s: f'Ubuntu {s}']),
Rule(patterns=[pattern('debian')],
actions=[lambda s: f'Debian {s}']),
Rule(patterns=[re.compile(f'linux{space}*mint{space}*{two_digit_version}', flags=re.IGNORECASE)],
actions=[lambda s: f'Linux Mint {s}']),
]

Auto-generation of knowledge occurs through two mechanisms:

  1. Traversing all database records in the background.
  2. Generating knowledge for newly created records (this includes records that came to us through remote search).

Auto-generated knowledge does not participate in gossip among the network.

@drew2a
Copy link
Contributor

drew2a commented Jan 18, 2024

The third paragraph is dedicated to the UI.

UI

Three changes have been made to the UI:

  1. Elements for displaying tags have been added.

image

  1. A dialog for editing metadata has been introduced.

image

  1. A snippet has been added that groups elements by the CONTENT_ITEM field.

image

Also, a feature for searching by tags was added, but this feature hasn't been introduced to the users yet.

image

@qstokkink
Copy link
Contributor

Tags have now been implemented and even documented (above). With that, this issue is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants