-
-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation after creating many classes with the same name #826
Comments
Looks like this change caused it (using git bisect):
|
That’s highly unfortunate and we probably should finally make benchmarks part of our workflow. If someone is really good at profiling, I’m all eyes. That said, I don’t think #575 is related to your benchmark (it talks only about creation of classes, not instantiation) and I also think it was more of a general observation facing a growing code base – not a timely regression. |
It does look like this section here would take progressively longer with similarly named classes: eda9f2d#diff-ca2708dc9bf685ad676dfaa594561d8fc5c18ae39a7a5bc517049ef3fce09541R1002 |
OMG @euresti thank you so much – that was fast! |
Am I squinting correctly that this is a caused by an edge case where a lot of classes with the same fully-qualified name are created? Is that really the case in your prod or are we chasing a red herring here possibly? |
But is this issue actually real? It only happens in timeit because we're creating a class inside a method in a big loop. I don't think this is super common behavior. If you do this:
The problem is gone. Or if you really want to create a different class every time you can use make_class and give each a different name:
|
It was really the case in our production environment until today when I finally tracked down the cause of the issue! 😅 |
It seems like it would be easy enough to support this, even though I agree that it is almost always avoidable. If instead of a |
Possible, though you might have to deal with multithreading issues then. |
It seems like a SafeUUID (https://docs.python.org/3/library/uuid.html#uuid.SafeUUID) would fit the bill then? The same |
I forgot to mention. The other requirement is that future runs generate the same "filename" This is to support tools that look at the stack trace. |
I was wondering if that was going to be a requirement... That does complicate the issue a little. I will think on this a little. |
So, this isn't pretty... but I think something along this line would work:
Other than running an internal cache for unique names per class, which would bloat with the more common case of many unique |
Perhaps the easier solution is to just warn against using
That is essentially the code that broke our system. It was written in 2017 and in all that time every engineer who looked at it must have thought it looked fine enough to leave it. |
I mean, if the requirement is only that future runs generate the same filename, and not guarding against multiple versions of attrs installed at the same time all fighting for the same spots in the line cache, why not use a counter? _unique_filename_counts = {}
def _generate_unique_filename(cls, func_name):
"""
Create a "filename" suitable for a function being generated.
"""
cls_name = getattr(cls, "__qualname__", cls.__name__)
key = (func_name, cls.__module__, cls_name)
count = _unique_filename_counts.get(key, 0) + 1
_unique_filename_counts[key] = count
extra = "-{0}".format(count) if count > 1 else ""
unique_filename = "<attrs generated {0} {1}.{2}{3}>".format(*key, extra)
cache_line = (1, None, (str(unique_id),), unique_filename)
linecache.cache[unique_filename] = cache_line
return unique_filename The current implementation appears to be akin to iterating every item in a list to calculate the length, instead of just keeping a tally every time the list changes. Creating a new class in every function call (especially one on a hot enough path to expose this perf problem) doesn't particularly seem wise, but it also seems entirely unexpected for these two snippets def do_it():
class Response:
pass
return Response()
def do_it_attr():
@attr.s
class Response:
pass
return Response() ... to differ in time by orders of magnitude
|
Unless I'm missing something
isn't safe if 2 threads are creating classes though. |
attr.s
classes
Just to sum a bit: we're currently using an optimistic version of handling concurrency which works great for common cases. That leads to an O(N) performance degradation though, if many identical classes are created. That's a very rare edge case as far as I can see. I'm willing to merge a solution that removes that edge case degradation, as long as:
Would you be willing to share info why you do what you do, so we can understand your use-case better? |
I originally commented this from my other Github account but for clarity's sake I'm reposting on my original account, sorry!
I shared a simplified version of what was happening in our codebase above, I'll copy it down here to for easier readability:
Whether writing code like this is good or bad, I think, is a different debate. I think the real question is: is it reasonable for a consumer of this library to assume that this is an okay thing to do? And I think it is reasonable to assume that. I have no idea what the motivation behind the unique filename is, or what requirements on it are nice-to-haves vs must-haves, but it seems to me that thread safety is really the only one that is a must-have? Earlier it was mentioned...
But the current implementation doesn't do that once threads are involved, so it seems like a soft requirement. So perhaps an easy solution is to take @theY4Kman's solution but also drop in
I believe this would function almost identically to the features that the current implementation has while being thread safe and not devouring the CPU. |
Thread names aren't guaranteed to be unique, right? We could maybe use In any case, I guess this issue was motivated by efficiency concerns but creating a new class for every function invocation is very inefficient by itself (you're evaling a couple functions at least [init and repr], not to mention creating a class object), so maybe just stop doing that? 😇 |
I definitely agree that it is not the ideal way to write Python code, but is it so egregious that it should cripple the CPU? Once I identified the issue it was trivial to fix, but it required a lot of monitoring tools to figure out that Not that it matters a terrible amount but the front page of https://www.attrs.org/en/stable/overview.html does promise "performance" and "no surprises". And as @theY4Kman mentioned neither of these code blocks look like CPU devouring code blocks:
|
Right, if we want |
Blowing up is extreme, but raising a helpful warning seems appropriate. |
Could we just hash the source code and use that? If you have a thousand identical classes, it would stand to reason they would share the |
Sorry for disappearing for a few days... life calls. Would it be unreasonable to ask what the use cases are for |
it's entirely reasonable to assume the odd collision in the names of classes. For example, off the top of my head I can tell you my work code base has two completely different attrs classes named That said, I might have an idea of how to improve the situation. |
But are they in the same module with the same class name? Just having two
I am pretty sure the only way to trigger this code path, outside of a local scope, is to do something like:
|
Or |
fixed by #828 |
This, to avoid python-attrs/attrs#826.
I suspect this is heavily related to #575, but my analysis suggests the issues showed up in
19.2.0 (2019-10-01)
and that issue was created before the release went live.I made a simple test harness to demonstrate the issue:
I ran this over
attrs
versions: 18.2, 19.1, 19.2, 20.1, 21.1, and 21.2.I collected the outputs:
This suggests that something that went along with the
19.2
release is causing a performance leak. We noticed the issue because our CPU metrics started to spike when we upgradedattrs
from version 18.2 directly to 21.2.The text was updated successfully, but these errors were encountered: