Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize incrementing values in DirectFileStore adapter #280

Merged
merged 1 commit into from
Mar 6, 2023

Conversation

splattael
Copy link
Contributor

@splattael splattael commented Feb 28, 2023

  • Introduce FileMappedDict#increment_value
  • Check for internal_storage only once (Process.pid is slow on Linux)
  • Lookup file position only once

Benchmarks

TLDR; Speed up counter and histogram by ~6-7%.

Warming up --------------------------------------
    counter perf-inc    22.380k i/100ms
Calculating -------------------------------------
    counter perf-inc    224.104k (± 0.8%) i/s -      1.141M in   5.093434s

Comparison:
    counter perf-inc:   224103.9 i/s
        counter main:   212032.1 i/s - 1.06x  slower

Warming up --------------------------------------
      gauge perf-inc    29.079k i/100ms
Calculating -------------------------------------
      gauge perf-inc    289.211k (± 1.8%) i/s -      1.454M in   5.029133s

Comparison:
          gauge main:   292329.3 i/s
      gauge perf-inc:   289210.9 i/s - same-ish: difference falls within error

Warming up --------------------------------------
  histogram perf-inc     8.522k i/100ms
Calculating -------------------------------------
  histogram perf-inc     84.796k (± 2.8%) i/s -    426.100k in   5.029627s

Comparison:
  histogram perf-inc:    84795.9 i/s
      histogram main:    80946.5 i/s - same-ish: difference falls within error
# frozen_string_literal: true

require 'prometheus/client'
require 'prometheus/client/formats/text.rb'
require 'prometheus/client/data_stores/direct_file_store.rb'
require 'pp'
require 'benchmark/ips'

FileUtils.rm_rf "/tmp/prometheus_benchmark"
FileUtils.mkdir_p "/tmp/prometheus_benchmark"

Prometheus::Client.config.data_store =
  Prometheus::Client::DataStores::DirectFileStore.new(
    dir: "/tmp/prometheus_benchmark"
  )

# pp Prometheus::Client.config

prometheus = Prometheus::Client.registry

counter = Prometheus::Client::Counter.new(:mycounter,
  docstring: 'Example counter',
  labels: [:foo])

gauge = Prometheus::Client::Gauge.new(:mygauge,
  docstring: 'Example gauge',
  labels: [:foo],
  store_settings: { aggregation: :sum })

histogram = Prometheus::Client::Histogram.new(:myhistogram,
  docstring: 'Example histogram',
  labels: [:foo],
  buckets: [0, 1, 2])

prometheus.register(counter)
prometheus.register(gauge)
prometheus.register(histogram)

labels = {'foo': 'bar'}

branch = `git rev-parse --abbrev-ref HEAD`.chomp

Benchmark.ips do |x|
  x.report("counter #{branch}") do
    counter.increment(by: 1, labels: labels)
  end

  x.compare!
  x.save! 'counter'
end

Benchmark.ips do |x|
  x.report("gauge #{branch}") do
    gauge.set(1, labels: labels)
  end

  x.compare!
  x.save! 'gauge'
end

Benchmark.ips do |x|
  x.report("histogram #{branch}") do
    histogram.observe(1, labels: labels)
  end

  x.compare!
  x.save! 'histogram'
end

* Introduce FileMappedDict#increment_value
* Check for internal_storage only once (Process.pid is slow on Linux)
* Lookup file position only once

Signed-off-by: Peter Leitzen <peter@leitzen.de>
Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@dmagliola
Copy link
Collaborator

I'm not opposed to this change, but:

  • I'm not sure it actually makes a measurable performance difference, two out of 3 benchmarks report no change. Note that in my experience, testing performance of the disk store is very hard to get meaningful results. We used to use a cloudvm to get realistic / repeatable results, testing locally was all over the place.
  • sounds like what's you're trying to avoid is the double call to internal_store. You could avoid that by assigning the result of internal_store to a local var, then you don't need the new method in FileMappedDict.
  • avoiding the double lookup in the positions hash should be below measurability. Especially when compared to writing to the file...

@dmagliola
Copy link
Collaborator

dmagliola commented Feb 28, 2023

Keeping in mind my comment that this may or may not be a greatly realistic test, I'm getting this on my Macbook M2 Air:

Warming up --------------------------------------
        counter main    26.202k i/100ms
Calculating -------------------------------------
        counter main    261.357k (± 0.4%) i/s -      1.310M in   5.012766s

Comparison:
    counter perf-inc:   265807.5 i/s
        counter main:   261356.7 i/s - 1.02x  slower

Warming up --------------------------------------
          gauge main    37.286k i/100ms
Calculating -------------------------------------
          gauge main    336.417k (±10.3%) i/s -      1.678M in   5.040241s

Comparison:
          gauge main:   336417.0 i/s
      gauge perf-inc:   320784.3 i/s - same-ish: difference falls within error

Warming up --------------------------------------
      histogram main     9.862k i/100ms
Calculating -------------------------------------
      histogram main     98.633k (± 1.2%) i/s -    493.100k in   5.000117s

Comparison:
  histogram perf-inc:   101441.5 i/s
      histogram main:    98633.1 i/s - 1.03x  slower

So, 3% faster on histograms, 2% faster on counter, same on gauges.
Second run: 2% on counter, same on gauges and hist
On a third run, I got a freakishly high result for gauges (23%) 😳

I'm not sure how much to trust this benchmark... :/

Doing:

              st = internal_store
              value = st.read_value(key)
              st.write_value(key, value + by.to_f)

(so, sidestepping the double call to pid)

I get more or less the same result (so, either "same", or within 1 to 4% improvement to performance)

I'm going to try to see if I remember how my "heavy" benchmarking code worked, and see what results I get, but again, keeping in mind that testing on a Macbook is probably not super representative.

In the meantime, i'd recommend doing just the change to sidestep the double call to internal_store, which adds almost zero complexity, and doesn't hurt, really :)

@dmagliola
Copy link
Collaborator

Ok, I've changed my mind :)

I've run the datastores benchmarks, on a somewhat random m6i.2xlarge on EC2 that I have at hand, and they seem to be consistently faster. About 6% indeed.

Doing the caching of the variable as I was suggesting is also faster than baseline, but only like 2% (ish)

These seem pretty repeatable, and they consistently give me those results.

So I'm convinced, we should merge this.

@Sinjo, any objections to this?

@dmagliola
Copy link
Collaborator

Also, sorry, what I should've started with.

Thank you for looking into this and for the PR @splattael! :D

@splattael
Copy link
Contributor Author

splattael commented Mar 1, 2023

@dmagliola Thanks a ton for reviewing and merging 🙇

You are right regarding benchmarks - they were flaky for me locally too (on Linux) but in the end they gave me overall speed improvement after running several times. This improvement makes sense (not only because of side-stepping Process.pid) but also having less calls.

BTW, I am planning to reduce the amount of getpid (via Process.pid) syscalls by setting the PID on startup and override Kernel.fork (or Process._fork in Ruby 3.1+). See https://bugs.ruby-lang.org/issues/5446 and https://bugs.ruby-lang.org/issues/17795 for more context.

👋 @SuperQ ❤️

@splattael
Copy link
Contributor Author

In the meantime, i'd recommend doing just the change to sidestep the double call to internal_store, which adds almost zero complexity, and doesn't hurt, really :)

I had this very refactoring ready in another branch but saw less speed improvements. 😅

I should have provided more context before, sorry about 🙇

@dmagliola
Copy link
Collaborator

No worries, and thanks for the PR!
I'm going to wait on @Sinjo to confirm he's cool with this, but this should go out in the next release :)

I am planning to reduce the amount of getpid (via Process.pid) syscalls by setting the PID on startup and override Kernel.fork (or Process._fork in Ruby 3.1+).

I may be misreading the docs here, but is this a method intended to be monkeypatched? 😅

That makes me a bit uncomfortable, but it seems like it's designed precisely to do that...
It's unfortunate the underscore in the name makes it hard to google for examples, all I can find is the test case where it gets introduced...

So I guess we could do that prepend in the constructor of DirectFileStore.

Would that look like:

def _fork
  super
  DirectFileStore.process_pid = Process.pid 
end

Am I understanding that correctly?

@splattael
Copy link
Contributor Author

@dmagliola Yes, is the way I understood it as well 👍

It seems that Rails, for example, already does exactly this for Ruby 3.1+. See https://github.com/rails/rails/blob/v7.0.2/activesupport/lib/active_support/fork_tracker.rb#L50

@dmagliola
Copy link
Collaborator

dmagliola commented Mar 5, 2023

Ah, that's amazing! You found exactly the kind of example I was hoping to get! 🙌
That sounds like a sensible thing to do, and i'm looking forward to the PR!

NOTE: I would do "half" of what Rails is doing. I'd prepend if _fork is defined, but if not (if Ruby < 3.1), i'd leave it alone and continue checking for Process.pid on every increment. So you get the ~4% perf boost if you're on a modern Ruby version only.

The logic here being that _fork is intended for you to do this, but the other 3 aren't. I'm open minded about this. The argument "look, Rails does it and here's exactly how to do it safely" is pretty compelling, I just think we should limit how much we pollute the global namespace with "surprising" stuff.

Copy link
Member

@Sinjo Sinjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow reply on my side. This all looks great!

I agree with @dmagliola about the _fork extension point - if it's there we should use it, and if it's not then people should upgrade their Ruby version if they want a performance boost. I'd rather not be the reason somebody ends up with weird behaviour in their application.

Open to changing my mind on that if we can convince ourselves really thoroughly that the options for older Rubies are safe.

@dmagliola dmagliola merged commit 2022a41 into prometheus:main Mar 6, 2023
@dmagliola
Copy link
Collaborator

Thank you @splattael for the perf improvement, and looking forward to future PRs! 🙌

@splattael
Copy link
Contributor Author

@dmagliola @Sinjo I agree! We should use _fork automatically if available 👍

I wonder, though, if we could provide a way for users to opt-in to use cached process ids (via 🐒 patching Process.fork)? 🤔

So:

  • For Ruby 3.1+ - cached process id on by default via _fork hook
  • For <Ruby 3.1 - off by default but opt-in allowed if setup code is called explicitly

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants