Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Reduce replication overhead #8283

Closed
clokep opened this issue Sep 8, 2020 · 8 comments
Closed

Reduce replication overhead #8283

clokep opened this issue Sep 8, 2020 · 8 comments
Labels
A-Performance Performance, both client-facing and admin-facing A-Workers Problems related to running Synapse in Worker Mode (or replication) T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@clokep
Copy link
Member

clokep commented Sep 8, 2020

From some profiling of matrix.org, it seems that the overhead of replication (using Redis) is ~20 - 25% per process. This figure was gather via profiling (via py-spy) a worker that had no traffic going to it.

We should investigate what's going on here!

Spun out of #7670

@clokep clokep self-assigned this Sep 8, 2020
@clokep
Copy link
Member Author

clokep commented Sep 8, 2020

The main entry points being discussed here are:

  • portions of GenericWorkerReplicationHandler._process_and_notify
    • Generally this is telling the notifier about new events
    • It looks like an OK amount of time is spent here iterating the rows to pull out particular fields
  • portions of ReplicationDataHandler.on_rdata, which corresponds to:
    • Calling process_replication_rows on the store (this recursively calls this method on quite a few of the stores, which makes the stack a bit odd to follow).
    • Processing replication of typing updates (? I think)
    • Telling the notifier about new events

Overall a decent amount of time is spent in Notifier.on_new_event, this:

Unfortunately the above callbacks generally spawn background processes which makes it a bit hard to see the amount of processing in each.

This was from the profile at profile.svg.zip (and similar related profiles) captured with: py-spy record -o profile.svg --pid $PID --gil --native --duration 60.

@clokep
Copy link
Member Author

clokep commented Sep 8, 2020

@erikjohnston Curious if you have any thoughts on this?

A few avenues that might be worth investigating:

  • Can we completely avoid Notifier.on_new_event in some situations? (Like if no callbacks are registered? Should those callbacks be registered on all workers, as they are now?)
  • We could likely save some time by lazily evaluating the inputs to on_new_notifier, e.g. using things like map(lambda row: row.room_id, rows) instead of [row.room_id for row in rows].

@clokep
Copy link
Member Author

clokep commented Sep 8, 2020

We could likely save some time by lazily evaluating the inputs to on_new_notifier, e.g. using things like map(lambda row: row.room_id, rows) instead of [row.room_id for row in rows].

I did some quick benchmarking of this and it seems that my assumption here is wrong. The list comprehension is faster than using a map. Although both are so fast it likely doesn't matter. 😄

Benchmark script
import timeit

# number of runs over the data for each repeat
N_RUNS = 20

# number of times to repeat the runs. We'll take the fastest run.
N_REPEATS = 5


NO_DATA = []
SM_DATA = range(10)
LG_DATA = range(10000)


def func(it):
    result = 0
    for d in it:
        result += d
    return result


def lister(data):
    func([d for d in data])


def mapper(data):
    func(map(lambda d: d, data))


for f in [lister, mapper]:
    print('Running {} benchmarks...'.format(f.__name__))

    for d in [NO_DATA, SM_DATA, LG_DATA]:
        times = timeit.repeat(lambda: f(d), number=N_RUNS, repeat=N_REPEATS)
        best = min(times)

        print('      For %d items: %i loops, best of %i: %f sec per loop' % (
            len(d), N_RUNS, len(times), best / N_RUNS,
        ))
Benchmark results
Running lister benchmarks...
      For 0 items: 20 loops, best of 5: 0.000000 sec per loop
      For 10 items: 20 loops, best of 5: 0.000001 sec per loop
      For 10000 items: 20 loops, best of 5: 0.000672 sec per loop
Running mapper benchmarks...
      For 0 items: 20 loops, best of 5: 0.000000 sec per loop
      For 10 items: 20 loops, best of 5: 0.000001 sec per loop
      For 10000 items: 20 loops, best of 5: 0.000849 sec per loop

@clokep clokep removed their assignment Sep 17, 2020
@clokep clokep added A-Performance Performance, both client-facing and admin-facing A-Workers Problems related to running Synapse in Worker Mode (or replication) labels Sep 17, 2020
@richvdh
Copy link
Member

richvdh commented May 13, 2021

I wonder how much of this is still relevant. Much of the replication handling code has been refactored in the last few months.

/cc @erikjohnston

@clokep
Copy link
Member Author

clokep commented May 13, 2021

I wonder how much of this is still relevant. Much of the replication handling code has been refactored in the last few months.

This was based on looking at py-spy traces where we were spending a lot of time in replication. I don't think any of the recent changes have affected the behavior I saw above (it was pretty baked into our architecture), but would love to be wrong! 🎉

I believe I capture the above by profiling the dummy worker.

@erikjohnston
Copy link
Member

I think we shuffled things around in this area a bit, but I imagine its still an issue. The test worker is idling at ~25% CPU.

@erikjohnston erikjohnston added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label May 20, 2021
@richvdh
Copy link
Member

richvdh commented Jun 8, 2022

Possibly we can just close this in favour of #12461?

@clokep
Copy link
Member Author

clokep commented Jun 8, 2022

That seems reasonable to me -- the other issue is more actionable.

@clokep clokep closed this as completed Jun 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing A-Workers Problems related to running Synapse in Worker Mode (or replication) T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

No branches or pull requests

3 participants