Performance degradation after creating many classes with the same name #826

MrCocoDev · 2021-06-15T20:04:31Z

I suspect this is heavily related to #575, but my analysis suggests the issues showed up in 19.2.0 (2019-10-01) and that issue was created before the release went live.

I made a simple test harness to demonstrate the issue:

def create_class():
    @attr.s
    class MyClass(object):
        access_token = attr.ib()
        expires = attr.ib()
        refresh_token = attr.ib()
        user = attr.ib()
    
    return MyClass(
        access_token="string",
        expires=0,
        refresh_token="string",
        user="string",
    )

%timeit create_class()

I ran this over attrs versions: 18.2, 19.1, 19.2, 20.1, 21.1, and 21.2.

I collected the outputs:

[
    # Version num, timeit output, whether or not timeit warned for caching
    (18.2, "358 µs ± 4.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)", False),
    (19.1, "356 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)", False),
    (19.2, "22 ms ± 9.5 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)", "The slowest run took 4.71 times longer than the fastest"),
    (20.1, "22.4 ms ± 9.85 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)", "The slowest run took 4.74 times longer than the fastest"),
    (21.1, "22.9 ms ± 9.74 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)", "The slowest run took 5.12 times longer than the fastest"),
    (21.2, "23.4 ms ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)", ""The slowest run took 4.91 times longer than the fastest"),
]

This suggests that something that went along with the 19.2 release is causing a performance leak. We noticed the issue because our CPU metrics started to spike when we upgraded attrs from version 18.2 directly to 21.2.

The text was updated successfully, but these errors were encountered:

euresti · 2021-06-15T20:29:56Z

Looks like this change caused it (using git bisect):

eda9f2de6f7026039c3eee80297f0c95382d51d8 is the first bad commit
commit eda9f2de6f7026039c3eee80297f0c95382d51d8
Author: David Euresti <david@euresti.com>
Date:   Thu Aug 1 03:42:13 2019 -0700

    Make unique_filename not depend on repr. (#560)
    
    * Make unique_filename not depend on repr.
    
    For debugging purposes, when a method is created it needs to be put
    into a "filename".  This filename was generated using `repr(attrs)`
    however this meant that the filename could change every time the
    program was loaded depending on the whims of repr.  (In particular
    free functions and lambdas would include ids)
    
    This solves the problem by changing the name to be <attrs generated
    {method} {module}.{qualname}> If it should happen that the name is
    taken then -2 will be appended and incremented until a free filename
    is found.
    
    * Add tests
    
    * Fix tests and changelog

 changelog.d/560.change.rst |  2 ++
 src/attr/_make.py          | 66 ++++++++++++++++++++++++++++++++--------------
 tests/test_dunders.py      | 46 ++++++++++++++++++++++++++++++++
 3 files changed, 94 insertions(+), 20 deletions(-)
 create mode 100644 changelog.d/560.change.rst

hynek · 2021-06-15T20:33:14Z

That’s highly unfortunate and we probably should finally make benchmarks part of our workflow.

If someone is really good at profiling, I’m all eyes.

That said, I don’t think #575 is related to your benchmark (it talks only about creation of classes, not instantiation) and I also think it was more of a general observation facing a growing code base – not a timely regression.

MrCocoDev · 2021-06-15T20:34:22Z

It does look like this section here would take progressively longer with similarly named classes:

eda9f2d#diff-ca2708dc9bf685ad676dfaa594561d8fc5c18ae39a7a5bc517049ef3fce09541R1002

hynek · 2021-06-15T20:34:42Z

OMG @euresti thank you so much – that was fast!

hynek · 2021-06-15T20:38:29Z

Am I squinting correctly that this is a caused by an edge case where a lot of classes with the same fully-qualified name are created?

Is that really the case in your prod or are we chasing a red herring here possibly?

euresti · 2021-06-15T20:42:40Z

But is this issue actually real? It only happens in timeit because we're creating a class inside a method in a big loop. I don't think this is super common behavior.

If you do this:

@attr.s
class MyClass(object):
    access_token = attr.ib()
    expires = attr.ib()
    refresh_token = attr.ib()
    user = attr.ib()


def create_class():
    return MyClass(
        access_token="string",
        expires=0,
        refresh_token="string",
        user="string",
    )

The problem is gone.

Or if you really want to create a different class every time you can use make_class and give each a different name:

import attr
import secrets

def create_class():
    my_class = attr.make_class(
        f"MyClass{secrets.token_hex()}",
        dict(
            access_token=attr.ib(),
            expires=attr.ib(),
            refresh_token=attr.ib(),
            user=attr.ib(),
        ),
    )

    return my_class(
        access_token="string",
        expires=0,
        refresh_token="string",
        user="string",
    )

MrCocoDev · 2021-06-15T20:43:23Z

Am I squinting correctly that this is a caused by an edge case where a lot of classes with the same fully-qualified name are created?

Is that really the case in your prod or are we chasing a red herring here possibly?

It was really the case in our production environment until today when I finally tracked down the cause of the issue! 😅

MrCocoDev · 2021-06-15T20:46:05Z

It seems like it would be easy enough to support this, even though I agree that it is almost always avoidable. If instead of a -1, -2, it could just be a hash of the count, or something similarly random. Presumably that is okay? Albeit ugly.

euresti · 2021-06-15T20:54:04Z

Possible, though you might have to deal with multithreading issues then.

MrCocoDev · 2021-06-15T21:10:14Z

It seems like a SafeUUID (https://docs.python.org/3/library/uuid.html#uuid.SafeUUID) would fit the bill then? The same while True loop structure will work to handle the small chance of collisions.

euresti · 2021-06-15T21:16:29Z

I forgot to mention. The other requirement is that future runs generate the same "filename" This is to support tools that look at the stack trace.

MrCocoDev · 2021-06-15T21:20:15Z

I was wondering if that was going to be a requirement... That does complicate the issue a little. I will think on this a little.

MrCocoDev · 2021-06-15T23:10:09Z

So, this isn't pretty... but I think something along this line would work:

# This should probably be a calculated value to balance the weight of scanning keys
# versus looping over generated names
N = 1000  
unique_id = uuid.uuid4()
extra = ""
count = 1
file_prefix = f"<attrs generated {func_name} {module}.{cls}"

while True:
    unique_filename = f"{file_prefix}{extra}>"

    # To handle concurrency we essentially "reserve" our spot in
    # the linecache with a dummy line.  The caller can then
    # set this value correctly.
    cache_line = (1, None, (str(unique_id),), unique_filename)
    if (
        linecache.cache.setdefault(unique_filename, cache_line) == cache_line
    ):
        return unique_filename

    # Looks like this spot is taken. Try again.
    if count < N:
        count += 1
    else:
        # We have to sort to avoid potential forever loops if for example `MyClass-1` is created
        # by a joker 
        matching_file_names = sorted(
            (
                x
                for x
                in linecache.cache.keys()
                if x.startswith(file_prefix)
            ),
            key=lambda cache_key: int(cache_key[len(file_prefix) + 1:-1] or 0),
        )[-1]
        count = int(matching_file_names[len(file_prefix) + 1:-1])
        count += 1

    extra = "-{0}".format(count)

Other than running an internal cache for unique names per class, which would bloat with the more common case of many unique attr.s classes, I am not sure of how to reduce the name generation time which remains consistent across initializations.

MrCocoDev · 2021-06-15T23:16:59Z

Perhaps the easier solution is to just warn against using attr.s classes in a local scope? It seems innocuous enough to do:

def my_function():
    @attr.s
    class ConvenientDataShape:
         thing = attr.ib()
    return ConvenientDataShape(thing="thing")

That is essentially the code that broke our system. It was written in 2017 and in all that time every engineer who looked at it must have thought it looked fine enough to leave it.

theY4Kman · 2021-06-16T03:29:49Z

I mean, if the requirement is only that future runs generate the same filename, and not guarding against multiple versions of attrs installed at the same time all fighting for the same spots in the line cache, why not use a counter?

_unique_filename_counts = {}


def _generate_unique_filename(cls, func_name):
    """
    Create a "filename" suitable for a function being generated.
    """
    cls_name = getattr(cls, "__qualname__", cls.__name__)
    key = (func_name, cls.__module__, cls_name)

    count = _unique_filename_counts.get(key, 0) + 1
    _unique_filename_counts[key] = count

    extra = "-{0}".format(count) if count > 1 else ""
    unique_filename = "<attrs generated {0} {1}.{2}{3}>".format(*key, extra)

    cache_line = (1, None, (str(unique_id),), unique_filename)
    linecache.cache[unique_filename] = cache_line

    return unique_filename

The current implementation appears to be akin to iterating every item in a list to calculate the length, instead of just keeping a tally every time the list changes.

Creating a new class in every function call (especially one on a hot enough path to expose this perf problem) doesn't particularly seem wise, but it also seems entirely unexpected for these two snippets

def do_it():
    class Response:
        pass

    return Response()


def do_it_attr():
    @attr.s
    class Response:
        pass

    return Response()

... to differ in time by orders of magnitude

In [2]: %timeit do_it()
3.9 µs ± 28.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [3]: %timeit do_it_attr()
The slowest run took 4.69 times longer than the fastest. This could mean that an intermediate result is being cached.
12.6 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)

euresti · 2021-06-16T03:39:55Z

Unless I'm missing something

    count = _unique_filename_counts.get(key, 0) + 1
    _unique_filename_counts[key] = count

isn't safe if 2 threads are creating classes though.

hynek · 2021-06-16T07:57:57Z

Just to sum a bit: we're currently using an optimistic version of handling concurrency which works great for common cases.

That leads to an O(N) performance degradation though, if many identical classes are created.

That's a very rare edge case as far as I can see.

I'm willing to merge a solution that removes that edge case degradation, as long as:

it remains thread-safe
doesn't cause a performance degradation to the common case

Would you be willing to share info why you do what you do, so we can understand your use-case better?

MrCocoDev · 2021-06-16T17:52:19Z

I originally commented this from my other Github account but for clarity's sake I'm reposting on my original account, sorry!

Would you be willing to share info why you do what you do, so we can understand your use-case better?

I shared a simplified version of what was happening in our codebase above, I'll copy it down here to for easier readability:

def my_function():
    @attr.s
    class ConvenientDataShape:
         thing = attr.ib()
    return ConvenientDataShape(thing="thing")

Whether writing code like this is good or bad, I think, is a different debate. I think the real question is: is it reasonable for a consumer of this library to assume that this is an okay thing to do? And I think it is reasonable to assume that.

I have no idea what the motivation behind the unique filename is, or what requirements on it are nice-to-haves vs must-haves, but it seems to me that thread safety is really the only one that is a must-have? Earlier it was mentioned...

The other requirement is that future runs generate the same "filename" This is to support tools that look at the stack trace.

But the current implementation doesn't do that once threads are involved, so it seems like a soft requirement. So perhaps an easy solution is to take @theY4Kman's solution but also drop in threading.current_thread().name (or its equivalent)? So that would look like:

import threading

...

_unique_filename_counts = {}

...

def _generate_unique_filename(cls, func_name):
    """
    Create a "filename" suitable for a function being generated.
    """
    thread_name = threading.current_thread().name
    cls_name = getattr(cls, "__qualname__", cls.__name__)
    key = (func_name, cls.__module__, cls_name, thread_name)

    count = _unique_filename_counts.get(key, 0) + 1
    _unique_filename_counts[key] = count

    extra = "-{0}".format(count) if count > 1 else ""
    unique_filename = "<attrs generated {0} {1}.{2}-{3}{4}>".format(*key, extra)

    cache_line = (1, None, (str(unique_id),), unique_filename)
    linecache.cache[unique_filename] = cache_line

    return unique_filename

I believe this would function almost identically to the features that the current implementation has while being thread safe and not devouring the CPU.

Tinche · 2021-06-17T10:42:22Z

Thread names aren't guaranteed to be unique, right?

We could maybe use itertools.count as an atomic counter?

In any case, I guess this issue was motivated by efficiency concerns but creating a new class for every function invocation is very inefficient by itself (you're evaling a couple functions at least [init and repr], not to mention creating a class object), so maybe just stop doing that? 😇

MrCocoDev · 2021-06-17T14:26:55Z

Thread names aren't guaranteed to be unique, right?

We could maybe use itertools.count as an atomic counter?

In any case, I guess this issue was motivated by efficiency concerns but creating a new class for every function invocation is very inefficient by itself (you're evaling a couple functions at least [init and repr], not to mention creating a class object), so maybe just stop doing that? 😇

I definitely agree that it is not the ideal way to write Python code, but is it so egregious that it should cripple the CPU? Once I identified the issue it was trivial to fix, but it required a lot of monitoring tools to figure out that attrs can and will eat the CPU on an innocuous mistake.

Not that it matters a terrible amount but the front page of https://www.attrs.org/en/stable/overview.html does promise "performance" and "no surprises". And as @theY4Kman mentioned neither of these code blocks look like CPU devouring code blocks:

def do_it():
    class Response:
        pass

    return Response()


def do_it_attr():
    @attr.s
    class Response:
        pass

    return Response()

Tinche · 2021-06-17T15:00:42Z

Right, if we want performance and no surprises we should throw an exception if we see more than 10 classes with the same name. Even if we fix this particular issue, we're still using the linecache cache for generated sources. So you're actually leaking memory. Better to just blow up and make you fix it.

hynek · 2021-06-18T09:14:46Z

Blowing up is extreme, but raising a helpful warning seems appropriate.

Tinche · 2021-06-18T11:52:50Z

Could we just hash the source code and use that? If you have a thousand identical classes, it would stand to reason they would share the linecache entry (since it'd be identical).

MrCocoDev · 2021-06-22T17:55:06Z

Sorry for disappearing for a few days... life calls.

Would it be unreasonable to ask what the use cases are for MyClass, MyClass-2, and MyClass-N. It seems to me that any duplication here is essentially a "bad" use of the library, since it will eventually lead to performance problems. What was the driving force behind this code?

Tinche · 2021-06-22T23:40:32Z

it's entirely reasonable to assume the odd collision in the names of classes. For example, off the top of my head I can tell you my work code base has two completely different attrs classes named User.

That said, I might have an idea of how to improve the situation.

MrCocoDev · 2021-06-23T13:13:30Z

it's entirely reasonable to assume the odd collision in the names of classes. For example, off the top of my head I can tell you my work code base has two completely different attrs classes named User.

That said, I might have an idea of how to improve the situation.

But are they in the same module with the same class name? Just having two Users classes won't cause a collision because the naming is:

"<attrs generated {0} {1}.{2}-{3}>".format(func_name, module, class, extra)

I am pretty sure the only way to trigger this code path, outside of a local scope, is to do something like:

@attr.s
class User:  # User
    pass

Reference = User

@attr.s
class User:  # User-2
    pass

hynek · 2021-06-23T14:05:13Z

Or make_class

hynek · 2021-06-30T06:42:29Z

fixed by #828

This, to avoid python-attrs/attrs#826.

hynek changed the title ~~Performance degradation after creating many attr.s classes~~ Performance degradation after creating many classes with the same name Jun 16, 2021

Tinche mentioned this issue Jun 23, 2021

Rework linecache handling #828

Merged

hynek closed this as completed Jun 30, 2021

This was referenced Dec 29, 2021

performance regression from 3.2.0 to 4.0.1 python-jsonschema/jsonschema#853

Closed

Creating lots of validator classes uses lots of memory python-jsonschema/jsonschema#868

Closed

ynohat added a commit to ynohat/bossman that referenced this issue Dec 30, 2021

chore(deps): pin attrs to >=21.3.0,<22.0.0

06a85d9

This, to avoid python-attrs/attrs#826.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation after creating many classes with the same name #826

Performance degradation after creating many classes with the same name #826

MrCocoDev commented Jun 15, 2021

euresti commented Jun 15, 2021

hynek commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

hynek commented Jun 15, 2021

hynek commented Jun 15, 2021

euresti commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

euresti commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

euresti commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

theY4Kman commented Jun 16, 2021

euresti commented Jun 16, 2021

hynek commented Jun 16, 2021

MrCocoDev commented Jun 16, 2021 •

edited

Loading

Tinche commented Jun 17, 2021

MrCocoDev commented Jun 17, 2021

Tinche commented Jun 17, 2021

hynek commented Jun 18, 2021 •

edited

Loading

Tinche commented Jun 18, 2021

MrCocoDev commented Jun 22, 2021

Tinche commented Jun 22, 2021

MrCocoDev commented Jun 23, 2021

hynek commented Jun 23, 2021 •

edited

Loading

hynek commented Jun 30, 2021 •

edited

Loading

Performance degradation after creating many classes with the same name #826

Performance degradation after creating many classes with the same name #826

Comments

MrCocoDev commented Jun 15, 2021

euresti commented Jun 15, 2021

hynek commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

hynek commented Jun 15, 2021

hynek commented Jun 15, 2021

euresti commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

euresti commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

euresti commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

MrCocoDev commented Jun 15, 2021

theY4Kman commented Jun 16, 2021

euresti commented Jun 16, 2021

hynek commented Jun 16, 2021

MrCocoDev commented Jun 16, 2021 • edited Loading

Tinche commented Jun 17, 2021

MrCocoDev commented Jun 17, 2021

Tinche commented Jun 17, 2021

hynek commented Jun 18, 2021 • edited Loading

Tinche commented Jun 18, 2021

MrCocoDev commented Jun 22, 2021

Tinche commented Jun 22, 2021

MrCocoDev commented Jun 23, 2021

hynek commented Jun 23, 2021 • edited Loading

hynek commented Jun 30, 2021 • edited Loading

MrCocoDev commented Jun 16, 2021 •

edited

Loading

hynek commented Jun 18, 2021 •

edited

Loading

hynek commented Jun 23, 2021 •

edited

Loading

hynek commented Jun 30, 2021 •

edited

Loading