-
-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batching of lint
and fmt
invokes
#14186
Batching of lint
and fmt
invokes
#14186
Conversation
[ci skip-rust] [ci skip-build-wheels]
…hly a given size. [ci skip-build-wheels]
[ci skip-build-wheels] [ci skip-rust]
[ci skip-rust] [ci skip-build-wheels]
7531a57
to
e28c35b
Compare
|
||
# To stably partition the arguments into ranges of at least `size_min`, we sort them, and | ||
# create a new batch sequentially once we have the minimum number of entries, _and_ we encounter | ||
# an item hash prefixed with a threshold of zeros. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely missing what the leading zeros check is about. You explain trying to accommodate adding items disturbing batches minimally, but I don't understand the mechanism of how this helps. Is it tied to characteristics of the Rust hash function used? Maybe a unit test of this function that shows how adding an item to an N bucket chain only results in ~1 bucket changing contents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to unit tests of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can sort of see what's going on here, but some comments explaining it would be really helpful. Especially justifying the selection of zero_prefix_threshold .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanded the comment and added a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat, thanks.
"--batch-size", | ||
advanced=True, | ||
type=int, | ||
default=128, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We technically shouldn't change this default after we release this. Thoughts if it's worth us trying to benchmark what the optimal number is? I imagine that's hard to arrive at, including depending on your machine's specs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just machine specs, but each tool will likely exhibit different characteristics affected by batch size 🤔
Could be fun to benchmark though 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a bit of benchmarking to settle on 128
here on an integration branch with #9964 included: 128
was best by ~1%. Additional benchmarking and adjustment after both this and #9964 have landed will be good, since they interplay strongly with one another: I'll include more numbers over there.
Not just machine specs, but each tool will likely exhibit different characteristics affected by batch size 🤔
Yea, there are potentially a lot of dimensions here. But I think that from a complexity perspective, we're not going to want to expose per-tool batch size knobs without some quantitative justification (post landing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this one had me stumped until I tried it. My assumption was that the batches would fill all the threads and so in-tool parallelization would only result in over-allocating your resources. What I see though is that depending on the number of threads available, number of files in the target-list, and batch size, there are many points in time you're running fewer batches than available threads.
Of course, (and I hope to show this via data) using the poor-man's in-tool parallelization is likely not ideal as it isn't dynamic and would result in over-allocation of resources in the "bursts" where there are more-than-thread-count of rules.
|
||
# To stably partition the arguments into ranges of at least `size_min`, we sort them, and | ||
# create a new batch sequentially once we have the minimum number of entries, _and_ we encounter | ||
# an item hash prefixed with a threshold of zeros. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to unit tests of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python looks good to me! Tomorrow, time-permitting I can give you some timings on our monorepo (and compare them to the poor-man's in-tool parallelism as well).
Additionally, I'd be willing to spend a few cycles giving a data point or two on batch size.
It might be worth measuring the cache characteristics of this change over a few changes. Conceivably this change help both performance and cache size (as adding new files would only invalidate less-than-N buckets)
And speaking of measuring over a few changes, this has got my wheels spinning on possible bucketing techniques. I'm curious to see how partition_sequentially
works (shown via unit tests, ideally 😉 ).
@@ -39,7 +40,7 @@ def register_options(cls, register): | |||
|
|||
|
|||
class TffmtRequest(FmtRequest): | |||
pass | |||
field_set_type = TerraformFieldSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not specific to this PR)
I find it somewhat noteworthy that as Pants evolves it's way of doing things, little idiosyncrasies come out of the woodwork (like this one). In #14182 I noticed GofmtRequest
didn't inherit from LintRequest
.
It seems to me potentially hazardous that code can "seemingly work" in one dimension or another, but then a change brings to light that it was incorrectly configured. I wonder when third-party plugins become more standard how we can be proactive about avoiding these possible idiosyncrasies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular change is because mypy
can't check some usages of ClassVar
, unfortunately. This is an abstract ClassVar
that it failed to check was implemented for this subclass of FmtRequest
.
In #14182 I noticed
GofmtRequest
didn't inherit fromLintRequest
.
That is because @union
doesn't actually require that a type extend any particular interface, although @kaos has prototyped changing that: #12577. There is rough consensus that we should change unions, but not exactly how.
"--batch-size", | ||
advanced=True, | ||
type=int, | ||
default=128, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just machine specs, but each tool will likely exhibit different characteristics affected by batch size 🤔
Could be fun to benchmark though 😉
Thanks! As mentioned in the comments, #9964 will further affect this. My own benchmarking has shown that combining the two is a little bit faster than either alone. |
batch.append(item) | ||
if ( | ||
len(batch) >= size_min | ||
and native_engine.hash_prefix_zero_bits(item_key) >= zero_prefix_threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what the overhead of calling through to Rust is, but are we sure that doing it a lot in a tight loop like this is worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be very, very low... similar to calling into any builtin Python function. The rust function takes a reference to the Python string, so it shouldn't even be copied at the boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> timeit.timeit(lambda: native_engine.hash_prefix_zero_bits("example"), number=1000000)
0.19578884500000981
>>> timeit.timeit(lambda: hash("example"), number=1000000)
0.14594443799998658
batch: list[_T] = [] | ||
|
||
def emit_batch() -> list[_T]: | ||
assert batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we idiomatically use assert
in non-test code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few other non-test instances of assert
in util
... but this seems like a safe case: validating that a private function is called correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @stuhood using asserts in normal code is a good way to assert assumptions, document what might otherwise not be necessarily obvious, and of course to coerce mypy.
|
||
# To stably partition the arguments into ranges of at least `size_min`, we sort them, and | ||
# create a new batch sequentially once we have the minimum number of entries, _and_ we encounter | ||
# an item hash prefixed with a threshold of zeros. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can sort of see what's going on here, but some comments explaining it would be really helpful. Especially justifying the selection of zero_prefix_threshold .
[ci skip-rust] [ci skip-build-wheels]
b83c36c
to
d14468e
Compare
Applied all review feedback: please take another look. |
# probability of a hash prefixed with Z zero bits is 1/2^Z, and so to break after N items on | ||
# average, we look for `Z == log2(N)` zero bits. | ||
# | ||
# Breaking on these deterministic boundaries means that adding any single item will affect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Ok, but this assumes ... the same CLI specs with files edits / adds in between? In other words this whole scheme does ~nothing for ./pants lint bob::
followed by ./pants lint ::
- those will likely have completely different buckets - no hits at all. If that's right and all this fanciness is to support repeatedly running the same Pants command - that's great but seems useful to call out somewhere. I'm not sure where. Here is not the place since this would be about the larger picture and the call after call inputs to this function from a higher layer. If I've got this wrong and this always works - that's awesome. I'm still mystified though how that works. If I've got it right, it would be great to get some comment that does this spelling out that we're optimizing a particular use pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Ok, but this assumes ... the same CLI specs with files edits / adds in between? In other words this whole scheme does ~nothing for
./pants lint bob::
followed by./pants lint ::
- those will likely have completely different buckets - no hits at all.
This is mostly about optimizing the re-running of a single command over time, yea. But because the inputs are sorted before hashing, bob::
should fall entirely into sequential buckets within the entire set of ::
, and some of those might be able to hit in the larger set.
Unfortunately, it's hard to make promises about that in the presence of the min
/max
thresholds. If we were breaking purely along bucket boundaries (without additionally setting min/max sizes), we could make better guarantees (because the bucket boundaries within bob::
would be the same when the rest of ::
was added).
Thanks for raising this point though... it does actually seem like a desirable enough use case to try and strengthen this further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if strengthening this further, or even using the current trickery, belongs in this PR. IIUC batching is purely a win over the status quo today, even if the batches are simply formed from fixed size buckets over the sorted input and nothing more. If that's true, then this PR is only muddied by the fiddly bits here and maybe adding the fiddly bits en-masse could be done as a follow up that gives magic speed ups for free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I think that doing it here might be useful is that it would adjust the meaning of the --batch-size
flag (from "minimum size" to "target size")... it should be a quick change I think. It is actually desirable for smaller batches to improve cache hit rates (which is why the per-file-caching
flag existed in the first place).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Eric-Arellano brought up a good point above about the fact that changing the batch size or implementation could have negative repercussions (due to the cache being invalidated). However as @jsirois points out any batching is a win.
IMO the solution with #9964 is the "baseline" and any improvement over that is gravy. If Pants finds better solutions/numbers over time I'm OK with a one-time cache bust which result in speedier runs going forward.
Because of the amount of variance here with tools/machine/repo I think the best approach is probably a data-driven one. You might try building a performance test suite to test various numbers/algorithms and asking the community (and/or Toolchain's customers) to run it overnight (or during a dead time) on a machine or two. I know I'd happily volunteer some CPU cycles to help find the optimal config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I think that doing it here might be useful is that it would adjust the meaning of the --batch-size flag (from "minimum size" to "target size")
Imo it's fine to change the behavior up till 2.10.0rc0. This seems worth landing as-is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...it was not, in fact, a "quick change", because removing the min threshold and attempting to use only the hash to choose boundaries requires increasing the threshold to the point where our sweet spot of bucket size (~128) is too likely to never encounter a good boundary.
Since the code as-is already snaps to boundaries, I feel fairly confident that we'll already get some matches in the bob::
vs ::
case, but can defer looking at that to a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't help myself. #14210.
"--batch-size", | ||
advanced=True, | ||
type=int, | ||
default=128, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a power of two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
I want to check my mental model with this and #9964. We avoid too much parallelism thanks to the engine:
|
# probability of a hash prefixed with Z zero bits is 1/2^Z, and so to break after N items on | ||
# average, we look for `Z == log2(N)` zero bits. | ||
# | ||
# Breaking on these deterministic boundaries means that adding any single item will affect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I think that doing it here might be useful is that it would adjust the meaning of the --batch-size flag (from "minimum size" to "target size")
Imo it's fine to change the behavior up till 2.10.0rc0. This seems worth landing as-is
return set(tuple(p) for p in partition_sequentially(items, key=str, size_min=size_min)) | ||
|
||
# We start with base items containing every other element from a sorted sequence. | ||
all_items = sorted((f"item{i}" for i in range(0, 64))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not burn a tree for this, but FYI you already had a generator expression and the extra ()
is superfluous:
all_items = sorted((f"item{i}" for i in range(0, 64))) | |
all_items = sorted(f"item{i}" for i in range(0, 64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Eric-Arellano I think there is a flake8 plugin that detects those... maybe we should add it to this repo.
https://pypi.org/project/flake8-comprehensions/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. will work on this. PRs coming soon.
…ility. (#14210) As a followup to #14186, this change improves the stability (and thus cache hit rates) of batching by removing the minimum bucket size. It also fixes an issue in the tests, and expands the range that they test. As mentioned in the expanded comments: capping bucket sizes (in either the `min` or the `max` direction) can cause streaks of bucket changes: when a bucket hits a `min`/`max` threshold and ignores a boundary, it increases the chance that the next bucket will trip a threshold as well. Although it would be most-stable to remove the `max` threshold entirely, it is necessary to resolve the correctness issue of #13462. But we _can_ remove the `min` threshold, and so this change does that. [ci skip-rust] [ci skip-build-wheels]
I didn't see this sooner. Does |
Possibly. But the big difference with
|
…14184) When tools support internal concurrency and cannot be partitioned (either because they don't support it, such as in the case of a PEX resolve, or because of overhead to partitioning as fine-grained as desired), Pants' own concurrency currently makes it ~impossible for them to set their concurrency settings correctly. As sketched in #9964, this change adjusts Pants' local runner to dynamically choose concurrency values per process based on the current concurrency. 1. When acquiring a slot on the `bounded::CommandRunner`, a process takes as much concurrency as it a) is capable of, as configured by a new `Process.concurrency_available` field, b) deserves for the purposes of a fairness (i.e. half, for two processes). This results in some amount of over-commit. 2. Periodically, a balancing task runs and preempts/re-schedules processes which have been running for less than a very short threshold (`200ms` currently) and which are the largest contributors to over/under-commit. This fixes some over/under-commit, but not all of it, because if a process becomes over/under-committed after it has been running a while (because other processes started or finished), we will not preempt it. Combined with #14186, this change results in an additional 2% speedup for `lint` and `fmt`. But it should also have a positive impact on PEX processes, which were the original motivation for #9964. Fixes #9964. [ci skip-build-wheels]
This adds generic support for `lint` implementations that do not deal with targets. That allows us to merge `validate` into `lint`, which is much cleaner. ## CLI specs As before with the `validate` goal, it's not very intuitive how to get Pants to run on files not owned by targets, which you want for `validate`. `::` only matches files owned by targets, whereas `**` matches _all_ files regardless of targets. So, users of `regex-lint` should typically use `./pants lint '**'` rather than `./pants lint ::`, which is not intuitive https://docs.google.com/document/d/1WWQM-X6kHoSCKwItqf61NiKFWNSlpnTC5QNu3ul9RDk/edit#heading=h.1h4j0d5mazhu proposes changing `::` to match all files, so you can simply use `./pants lint ::`. I don't think we need to block on this proposal? This is still forward progress, and also `validate`/`regex-lint` is not used very much fwict. ## Batching We don't yet batch per #14186, although it would be trivial for us to hook up. I'm only waiting to do it till we can better reason about if it makes sense to apply here too. ## The `fmt` goal Note that we need more design for `fmt` before we can apply this same change there. fmt is tricky because we run each formatter for a certain language sequentially so that they don't overwrite each other; but we run distinct languages in parallel. We would need some way to know which "language" target-less files are for. ## "Inferred targets" A related technology would be inferred targets, where you don't need a BUILD flie but we still have a target: #14074. This is a complementary technology. The main difference here is that we can operate on files that will _never_ have an owning target, such as a BUILD file itself. [ci skip-rust] [ci skip-build-wheels]
As described in #13462, there are correctness concerns around not breaking large batches of files into smaller batches in
lint
andfmt
. But there are other reasons to batch, including improving the performance of linters which don't support internal parallelism (by breaking them into multiple processes which can be parallelized).This change adds a function to sequentially partition a list of items into stable batches, and then uses it to create batches by default in
lint
andfmt
. Sequential partitioning was chosen rather than bucketing by hash, because it was easier to reason about in the presence of minimum and maximum bucket sizes.Additionally, this implementation is at the level of the
lint
andfmt
goals themselves (rather than within individuallint
/fmt
@rule
sets, as originally suggested on the ticket) because that reduces the effort of implementing a linter or formatter, and will likely ease doing further declarative partitioning in those goals (byField
values, for example)../pants --no-pantsd --no-local-cache --no-remote-cache-read fmt lint ::
runs about ~4% faster than on main.Fixes #13462.
[ci skip-build-wheels]