Reduce replication overhead #8283

clokep · 2020-09-08T18:03:57Z

From some profiling of matrix.org, it seems that the overhead of replication (using Redis) is ~20 - 25% per process. This figure was gather via profiling (via py-spy) a worker that had no traffic going to it.

We should investigate what's going on here!

Spun out of #7670

clokep · 2020-09-08T18:46:49Z

The main entry points being discussed here are:

portions of GenericWorkerReplicationHandler._process_and_notify
- Generally this is telling the notifier about new events
- It looks like an OK amount of time is spent here iterating the rows to pull out particular fields
portions of ReplicationDataHandler.on_rdata, which corresponds to:
- Calling process_replication_rows on the store (this recursively calls this method on quite a few of the stores, which makes the stack a bit odd to follow).
- Processing replication of typing updates (? I think)
- Telling the notifier about new events

Overall a decent amount of time is spent in Notifier.on_new_event, this:

Finds the user stream for each user / room.
Notifies each user stream.
Notifies replication (this seems to be the bulk of the time spent), which calls callbacks which can be registered, it seems only a few are:
- PresenceHandler.notify_new_event (this is only added as a callback if presence_enabled)
- StatsHandler.notify_new_event (this is only added as a callback if stats_enabled)
- UserDirectoryHandler.notify_new_event (this is only added as a callback if update_user_directory)
- ReplicationStreamer.on_notifier_poke (this is unconditionally registered)

Unfortunately the above callbacks generally spawn background processes which makes it a bit hard to see the amount of processing in each.

This was from the profile at profile.svg.zip (and similar related profiles) captured with: py-spy record -o profile.svg --pid $PID --gil --native --duration 60.

clokep · 2020-09-08T18:55:09Z

@erikjohnston Curious if you have any thoughts on this?

A few avenues that might be worth investigating:

Can we completely avoid Notifier.on_new_event in some situations? (Like if no callbacks are registered? Should those callbacks be registered on all workers, as they are now?)
We could likely save some time by lazily evaluating the inputs to on_new_notifier, e.g. using things like map(lambda row: row.room_id, rows) instead of [row.room_id for row in rows].

clokep · 2020-09-08T20:53:43Z

We could likely save some time by lazily evaluating the inputs to on_new_notifier, e.g. using things like map(lambda row: row.room_id, rows) instead of [row.room_id for row in rows].

I did some quick benchmarking of this and it seems that my assumption here is wrong. The list comprehension is faster than using a map. Although both are so fast it likely doesn't matter. 😄

Benchmark script

import timeit

# number of runs over the data for each repeat
N_RUNS = 20

# number of times to repeat the runs. We'll take the fastest run.
N_REPEATS = 5


NO_DATA = []
SM_DATA = range(10)
LG_DATA = range(10000)


def func(it):
    result = 0
    for d in it:
        result += d
    return result


def lister(data):
    func([d for d in data])


def mapper(data):
    func(map(lambda d: d, data))


for f in [lister, mapper]:
    print('Running {} benchmarks...'.format(f.__name__))

    for d in [NO_DATA, SM_DATA, LG_DATA]:
        times = timeit.repeat(lambda: f(d), number=N_RUNS, repeat=N_REPEATS)
        best = min(times)

        print('      For %d items: %i loops, best of %i: %f sec per loop' % (
            len(d), N_RUNS, len(times), best / N_RUNS,
        ))

Benchmark results

Running lister benchmarks...
      For 0 items: 20 loops, best of 5: 0.000000 sec per loop
      For 10 items: 20 loops, best of 5: 0.000001 sec per loop
      For 10000 items: 20 loops, best of 5: 0.000672 sec per loop
Running mapper benchmarks...
      For 0 items: 20 loops, best of 5: 0.000000 sec per loop
      For 10 items: 20 loops, best of 5: 0.000001 sec per loop
      For 10000 items: 20 loops, best of 5: 0.000849 sec per loop

richvdh · 2021-05-13T11:55:08Z

I wonder how much of this is still relevant. Much of the replication handling code has been refactored in the last few months.

/cc @erikjohnston

clokep · 2021-05-13T12:09:17Z

I wonder how much of this is still relevant. Much of the replication handling code has been refactored in the last few months.

This was based on looking at py-spy traces where we were spending a lot of time in replication. I don't think any of the recent changes have affected the behavior I saw above (it was pretty baked into our architecture), but would love to be wrong! 🎉

I believe I capture the above by profiling the dummy worker.

erikjohnston · 2021-05-13T13:26:38Z

I think we shuffled things around in this area a bit, but I imagine its still an issue. The test worker is idling at ~25% CPU.

richvdh · 2022-06-08T16:55:35Z

Possibly we can just close this in favour of #12461?

clokep · 2022-06-08T17:19:22Z

That seems reasonable to me -- the other issue is more actionable.

clokep self-assigned this Sep 8, 2020

clokep removed their assignment Sep 17, 2020

clokep added A-Performance Performance, both client-facing and admin-facing A-Workers Problems related to running Synapse in Worker Mode (or replication) labels Sep 17, 2020

richvdh mentioned this issue Oct 22, 2020

createRoom succeeded, but returned 500, due to replication lag #8625

Open

clokep mentioned this issue Oct 28, 2020

Don't pull event from DB when handling replication traffic. #8669

Merged

erikjohnston added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label May 20, 2021

clokep closed this as completed Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce replication overhead #8283

Reduce replication overhead #8283

clokep commented Sep 8, 2020

clokep commented Sep 8, 2020

clokep commented Sep 8, 2020 •

edited

Loading

clokep commented Sep 8, 2020

richvdh commented May 13, 2021

clokep commented May 13, 2021

erikjohnston commented May 13, 2021

richvdh commented Jun 8, 2022 •

edited

Loading

clokep commented Jun 8, 2022

Reduce replication overhead #8283

Reduce replication overhead #8283

Comments

clokep commented Sep 8, 2020

clokep commented Sep 8, 2020

clokep commented Sep 8, 2020 • edited Loading

clokep commented Sep 8, 2020

richvdh commented May 13, 2021

clokep commented May 13, 2021

erikjohnston commented May 13, 2021

richvdh commented Jun 8, 2022 • edited Loading

clokep commented Jun 8, 2022

clokep commented Sep 8, 2020 •

edited

Loading

richvdh commented Jun 8, 2022 •

edited

Loading