Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2437: Store tagged events in Room Account Data #2437

Open
wants to merge 3 commits into
base: old_master
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions proposals/2437-tagged-events-account-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# MSC2437: Store tagged events in Room Account Data

We want here to let the users tag some room events.

The first use case would be to let him mark some events as favorites to keep a
reference on important messages or attachments. Clients would then be able to
handle a global bookmark of these favourite events, or display a list of them
at the room level.

A second use case would be to hide/ignore some events in order to prevent their
display in the room history.

The proposal provides a model to store the tagged events in the room account
data. The room account data has been preferred to the global account data in
order to remove/forget the potential tagged events when the user leaves a room.

## Proposal

Define `m.tagged_events` event type to store the tagged events of a room in
the room config. Clients will then set or get this account data content thanks
to the following APIs:

```
PUT /_matrix/client/r0/user/{userId}/rooms/{roomId}/account_data/m.tagged_events
GET /_matrix/client/r0/user/{userId}/rooms/{roomId}/account_data/m.tagged_events
```

The content of this event is a `tags` key whose value is an object mapping the
giomfo marked this conversation as resolved.
Show resolved Hide resolved
name of each tag to another object.
The JSON object associated with each tag is an object where the keys are the
event IDs and values give information about the events.
The event information are given under the following fields:

* `keywords` (`[string]`) - A list of words which should describe the event,
useful for searching or viewing favourite events without decrypting them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth being clear here that the words in the list are user-defined and not something the spec even intends to go into (I don't think it makes sense to spec some "pre-defined keywords").

* `origin_server_ts` (`integer`) - The timestamp in milliseconds on originating
homeserver when this event was sent, useful to sort chronologically the events
(without retrieving the full description of the event).
* `tagged_at` (`integer`) - The timestamp, in milliseconds, when the event was
tagged, useful to view the most recent tagged events

These fields are optional. The value may be an empty object if no fields are
defined.

The name of a tag MUST NOT exceed 255 bytes.

The tag namespace is defined as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for using the same tagging scheme as other tags :D

Reusability ftw


* The namespace `m.*` is reserved for tags defined in the Matrix specification.
Clients must ignore any tags in this namespace they don't understand.
* The namespace `u.*` is reserved for user-defined tags. The portion of the
string after the `u.` is defined to be the display name of this tag. No other
semantics should be inferred from tags in this namespace.
* A client or app willing to use special tags for advanced functionality
should namespace them similarly to state keys: `tld.name.*`
* Any tag in the tld.name.* form but not matching the namespace of the current
client should be ignored.
* Any tag not matching the above rules should be ignored.

Two special names are listed in the specification: The following tags are
defined in the `m.*` namespace:

* `m.favourite`: The user's favourite events in the room.
* `m.hidden`: The events that the user wants to hide from the room history.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to attach m.hidden to the event itself, so that it is only fetched if the event is? There is no sense transferring a large number of hidden tags on every update.


Example of content:

```
{
"tags": {
"m.favourite": {
"$143273582443PhrSn:example.org": {
"keywords": [
"pets",
"cat"
],
"origin_server_ts": 1432735824653,
"tagged_at": 1432736124573
},
"$768903582234ttROC:example.org": {
"keywords": [
"pets",
"dog"
],
"origin_server_ts": 1432735825701,
"tagged_at": 1432735923671
},
"$7623459801236vBDSF:example.org": {
"origin_server_ts": 1432736234980,
"tagged_at": 1432736512898
}
},
"m.hidden": {
"$123765582441goFrt:example.org": {
"keywords": [
"out of topic"
],
"origin_server_ts": 1432735824700,
"tagged_at": 1432735825123
},
"$619203608012ttAZw:example.org": {},
"$423567908022kJert:example.org": {},
"$456765582441QsXCv:example.org": {
"keywords": [
"spamming"
]
}
},
"u.personal": {
"$903456781253Hhjkl:example.org": {
"keywords": [
"vacation",
"London"
],
"origin_server_ts": 1432735824667,
"tagged_at": 1432735857890
}
}
}
}
```

The benefits to provide the `origin_server_ts` for the favourite events is to
let clients filter/sort them without having to retrieve the full content of
the events. Thanks to this timestamps, clients may ignore the favourite events
which are outside the potential room retention period, or display them as
expired events.
Clients must be ready to handle the case where a favourite event has been
redacted.

About the hidden events, clients must hide these events from the room history.
The event information don't seem very useful. We have provided some of them as
example.

## Limitation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tags and all tag types are retrieved at once. For large numbers of tags it may make sense to load only specific types of tags or load them incrementally. For example if I have been using Matrix for a decade I don't want to repeatedly transfer 1000 "remembered" messages or similar. This could become a real issue with custom tags as the user may be using them to categorize incoming messages and the number of tags may grow very quickly.

I think that it is probably best to use a different storage system that can optimize common use cases.

  1. Tagging a message should be O(1).
  2. Untagging a message should be O(1).
  3. Fetching the most recent n tags of a given type should be O(n)
  4. Fetching messages with a given tag in the room should be O(n).
  5. Syncing should be O(added+removed).

Currently all of these are O(total number of tags) which seems unacceptable. The exception is syncing when no tags have been changed in which case it is O(1).

Copy link
Contributor

@bwindels bwindels Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more ideas to address the race, somewhat applicable to account data in general:

Encode the event id in the account data type like m.tagged_events.<event id>. We don't currently have the habit of encoding ids in the type though and likely need to escape . in the event id. We'd also need to extend the GET /account_data/ endpoint to accept a match by prefix rather than exact match for the type so you can list m.tagged_events.*.

Another way would be to do something akin to state events, add some key property which you can then specify when requesting GET /account_data/<type>/<key>. For m.tagged_events the key would be the event id. For m.direct it could be the user id of the DM target.

Given that clients have to encode the complete set of tagged events to add,
remove or modify one entry, it might happen that different clients overwrite
each other modifications to the tagged events.

This is a known limitation for any new event type added to the `account_data`
section of a room, or to the top-level `account_data`. For exemple, we already
giomfo marked this conversation as resolved.
Show resolved Hide resolved
suffer from it to handle the `m.direct` event type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A leftover comment from the previous review - there is an important difference between m.direct and this kind of data:

  • m.direct is defined in global account data, which greatly increases the race scope.
  • m.direct can be updated by running clients with no user interaction - e.g. when a DM is joined from another side, a client may set m.direct at exactly the same time the inviter updated m.direct for another room - which make the race condition uncontrollable by the user.

In contrast, the tagged events here are defined on the room level; and all changes to them originate from the user interaction (at least that's my impression from the MSC). This means that one user has to change the list on two different clients at once, which is certainly possible but not very practical; and the mitigation for that boils down to "just don't edit your tagged events on two phones at the same second".

This is why I think that the existing generic account_data API should usually be not too problematic for the purpose, and the robust list data management API is only a nice to have addition.


This limitation has been addressed for only one case: the room tagging. Indeed
a set of endpoints have been defined for the corresponding event type `m.tag`.
Here is the existing APIs for this feature:

```
GET /_matrix/client/r0/user/{userId}/rooms/{roomId}/tags
PUT /_matrix/client/r0/user/{userId}/rooms/{roomId}/tags/{tag}
DELETE /_matrix/client/r0/user/{userId}/rooms/{roomId}/tags/{tag}
```

Should we consider this approach for any new event type?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A guiding principle here is whether or not the schema is complicated enough to need an API on top of it. Something like pinned events doesn't get its own API because it's ~3 lines of code in most languages to do the operation, and the risk of overwriting other people's changes is largely a social problem and not a technical one (tell your friend to stop doing things so you can do it instead).

For this particular API, it's a complicated operation to add/remove things to the account data so a set of endpoints could be useful. The overlapping writes will still be a problem, but at least we can make it easier on client authors.

I'd suggest something like the following:

GET /_matrix/client/r0/user/{userId}/rooms/{roomId}/tagged_events
PUT /_matrix/client/r0/user/{userId}/rooms/{roomId}/tagged_events/{eventId}
  Takes a JSON body of {"tag": "m.favourite", "keywords": []} - the server can work out the rest of the fields.
DELETE /_matrix/client/r0/user/{userId}/rooms/{roomId}/tagged_events/{eventId}
DELETE /_matrix/client/r0/user/{userId}/rooms/{roomId}
   For easier implementations of "clear all tagged events"

There would be an open question of whether or not we need an API to get all events tagged as a certain key, but I think the /tagged_events API proposed here would be fine enough for clients to locally filter.

Another suggestion has been provided during the review: We could provide
account data endpoints to append and remove values to an array in a specific
account data value, something like:

```
PUT /_matrix/client/r0/user/{userId}/rooms/{roomId}/account_data/<type>/append
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RESTful approach would bring something like:

Suggested change
PUT /_matrix/client/r0/user/{userId}/rooms/{roomId}/account_data/<type>/append
POST /_matrix/client/r0/user/{userId}/rooms/{roomId}/account_data/<type>

DELETE /_matrix/client/r0/user/{userId}/rooms/{roomId}/account_data/<type>/index/4
```

For the `DELETE` method, you'd need some kind of `ETag` value to pass as an
`If-Match` header so avoid deleting an indexed that is out of date because of
other clients deleting an index as well. Perhaps the server could add some kind
of unique tag to the account data value when it is updated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think deleting by index is generally a good idea, exactly because it's very prone to races. Some kind of a locally-unique identifier would be much better here (and then, again pretty much in line with REST principles it could be DELETE /.../account_data/<type>/<element_id>.

I personally like this approach more than devising API endpoints for particular account data types. In particular, it's very well extensible to custom account data (and it's not too much of a hassle to manage where these generic endpoints can/cannot be used in m. namespace). I believe that it would be a great subject for a separate MSC though.


We mentioned here the discussion about this limitation to keep it in our mind.
This should be the subject of another proposal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - can I have an unstable prefix for this?