Support Elasticsearch/OpenSearch for user search #14608

babolivier · 2022-12-02T18:19:36Z

Preamble

Synapse's user search feature has a few long-standing known shortcomings when searching for display names, namely:

It's diacritic (accents) sensitive (Search should be accent insensitive (SYN-666) #1523)
It works badly with foreign languages (Search user does not work with some specific Vietnamese letters #13655)
It's case sensitive in non-latin languages (User search is case sensitive in non-latin languages #14630)

Addressing these issues is non-trivial with PostgreSQL's full text search capabilities. In this writeup I am exploring integrating Elasticsearch, which is a full text search engine backed by Apache Lucene, into Synapse.

Note that mentions of Elasticsearch in this writeup also include OpenSearch, which is AWS's fork of Elasticsearch and has, as far as I can tell, compatible APIs with Elasticsearch (at least regarding the features that are relevant to us).

Also note that I am focusing specifically on user search, and am not including message search to avoid scope creep.

Indexing

Elasticsearch's equivalent of an SQL table is called an index. An index contains a number of documents, which are freeform JSON blobs. In the context of user search, this is where we would store user profiles. This is an example of a document in an Elasticsearch index:

{
    "_index": "synapse_user_search_4",
    "_id": "@foo:bar.baz",
    "_version": 1,
    "_seq_no": 5,
    "_primary_term": 5,
    "found": true,
    "_source": {
        "user_id": "@foo:bar.baz",
        "display_name": "Léonard Foo",
        "avatar_url": null
    }
}

Here, we're mostly interested in two properties:

_id: the document's identifier. A document can be created without identifiers, in which case Elasticsearch automatically generates one for it.
_source: the document's data.

In this example, I've used the user's MXID as the document's ID (so that we can easily update it in the future), and the structure of profiles as they are returned by the /user_directory/search endpoint. In practice we'll probably want to remove the user_directory from the document's source, in order to avoid duplicating data.

Analysis

We want our greatly improved user search to be:

case insensitive
diacritics insensitive

For this, we need to create the Elasticsearch index and configure it with an analyzer. The analyzer is in charge of looking at all new piece of data and tokenise (while also retaining the document's source) it so it's easily searchable later on.

We create our index with a custom analyzer that is both case- and diacritics-insensitive:

PUT /synapse_user_directory_search

{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_asciifolding": {
          "tokenizer": "standard",
          "filter": [ "asciifolding", "lowercase" ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "display_name": {
        "type": "text",
        "analyzer": "std_asciifolding"
      }
    }
  }
}

There are a few things happening here:

in settings.analysis.analyzer, we define a custom analyzer on the index. This analyzer includes two token filters: asciifolding, which folds non-ASCII characters in tokens to an equivalent ASCII character (thus eliminating accents), and lowercase which forces tokens into lower case.
in mappings.properties, we assign our custom analyzer to the display_name, since this is where we might have diacritics and case variations.

Document insertion and update

When the index is created, we can start adding documents to it:

PUT /synapse_user_directory_search/_doc/@foo:bar.baz

{
    "user_id": "@foo:bar.baz",
    "display_name": "Léo Foo",
    "avatar_url": null
}

Note that the request for updating an existing document is identical to the one above. When a document is updated, its _version property is automatically incremented.

Search

Now we can search for users:

GET /synapse_user_directory_search/_search

{
    "query": {
        "match_phrase_prefix": {
            "display_name": {
                "query": "leo"
            }
        }
    }
}

(yes, this is a GET request with a body)

We use a match_phrase_prefix query to ensure we start matching at the start of a sequence of tokens, instead of starting matching in the middle of a token.

Results are then provided in the following format:

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "synapse_user_search_5",
                "_id": "@foo:bar.baz",
                "_score": 0.2876821,
                "_source": {
                    "user_id": "@foo:bar.baz",
                    "display_name": "Léo Foo",
                    "avatar_url": null
                }
            }
        ]
    }
}

On each hit, a score is associated in the _score property to help sort results.

With the query previously mentioned, the same score will be attributed to every result. We will probably want to use a more elaborate query, such as a boolean compound query, which would allow attributing a higher score to exact matches (see an example here). We will also probably want to tweak the query so that it also matches on MXIDs.

Integration

First off, we will probably want to make this integration optional. While there are valid use cases for requiring better user search results than the ones provided by PostgreSQL's full text search support, PostgreSQL is also good enough for most servers catering to a community that mostly uses latin-based languages. Mandating those servers to support Elasticsearch for user search would be an unnecessary burden, both in terms of resources and, if the Elasticsearch cluster is self-hosted, in maintenance.

Technically, integrating Elasticsearch into Synapse would mean writing our own interaction layer. Elastic does provide an official Python module, which even has async support, however this async support uses aiohttp for transport. I don't know if aiohttp is even compatible with Twisted, and I assume we will probably want to use Twisted agents to perform requests. There have been efforts in the past to write an Elasticsearch module for Twisted-based applications (txes2), but it looks widely out of date and unmaintained (and is still incompatible with Python 3).

Writing our own integration for Elasticsearch should not be very complex, however, since as demonstrated above we would only need to perform a couple of types of HTTP requests (creating/updating documents, and searching).

Migration

How to migrate user search data from PostgreSQL to Elasticsearch and back is an area which needs further research. It will likely need a script ran manually by the server admin, similar to the existing SQLite -> PostgreSQL migration script we currently have. There might be a way to make this migration incremental to ease the pain on servers with a very large number of users, but I'm not entirely sure.

Conclusion

Supporting Elasticsearch in Synapse looks like a pretty big amount of work, but I think it's also work that is worth putting in to enable more communities around the world to adopt Matrix. In my manual testing, most issues with user search that are caused by PostgreSQL's full text search engine seem to be resolved with Elasticsearch, apart from one edge case which I believe to be acceptable.

It is also worth considering that, once in place, we might also want to use Elasticsearch to handle message search, which has similar issues to user search.

To be clear, I am not claiming this work should be the team's highest priority by opening this issue - I mostly wanted to compile and share the findings from spending a limited amount of time researching options to improve user search.

The text was updated successfully, but these errors were encountered:

DMRobertson · 2022-12-02T18:49:19Z

I'd be interested to know how Lucene/ES compares to the other approaches that Sean describes in #13655 (comment)

babolivier · 2022-12-06T11:57:05Z

I'd be interested to know how Lucene/ES compares to the other approaches that Sean describes in #13655 (comment)

I've also explored one of those - ICU - in #14464. It improves support for non-latin languages to some extent, in that word boundaries are correctly detected. Though as far as I can tell it does not fix the other issues such as diacritics sensitivity or case sensitivity in non-latin languages.

Since don't think ES support will happen in the very near future, I'm going to try to land #14464, since it's still a noticeable improvement over what we currently have.

DMRobertson mentioned this issue Jan 12, 2023

message search support for non-english user #14827

Closed

matrixbot mentioned this issue Dec 21, 2023

Support Elasticsearch/OpenSearch for user search element-hq/synapse#14608

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Elasticsearch/OpenSearch for user search #14608

Support Elasticsearch/OpenSearch for user search #14608

babolivier commented Dec 2, 2022 •

edited

Loading

DMRobertson commented Dec 2, 2022

babolivier commented Dec 6, 2022

Support Elasticsearch/OpenSearch for user search #14608

Support Elasticsearch/OpenSearch for user search #14608

Comments

babolivier commented Dec 2, 2022 • edited Loading

Preamble

Indexing

Analysis

Document insertion and update

Search

Integration

Migration

Conclusion

DMRobertson commented Dec 2, 2022

babolivier commented Dec 6, 2022

babolivier commented Dec 2, 2022 •

edited

Loading