Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-107089: Improve Shelf.clear method performance #107090

Merged
merged 4 commits into from
Jul 29, 2023

Conversation

jtcave
Copy link
Contributor

@jtcave jtcave commented Jul 23, 2023

Shelf inherits its clear method from MutableMapping, but the implementation is poorly suited for that class. The clear mix-in is implemented by calling popitem in a loop. Each call to popitem constructs a new iterator over the shelf, which is a generator that does byte-to-str conversion on the keys in the dbm object. Unfortunately, dbm objects are non-iterable, and their keys method simply returns a list of all keys. Since each popitem call performs a full scan of the key space, the clear method ends up having O(n²) performance (assuming the delete and read operations amortize to O(1)).

By having the shelf just iterate over the list of dbm keys once, this is avoided.

@cpython-cla-bot
Copy link

cpython-cla-bot bot commented Jul 23, 2023

All commit authors signed the Contributor License Agreement.
CLA signed

@eendebakpt
Copy link
Contributor

Thanks for the clear description of the problem. The solution looks ok, but I am a bit concerned it is not fully backwards compatible with classes derived from Shelf. A safer option would be to move the clear method to the DbfilenameShelf.

@corona10
Copy link
Member

See: #107122

Lib/shelve.py Outdated Show resolved Hide resolved
@erlend-aasland erlend-aasland removed their request for review July 23, 2023 15:19
@corona10
Copy link
Member

Thanks for the clear description of the problem. The solution looks ok, but I am a bit concerned it is not fully backwards compatible with classes derived from Shelf. A safer option would be to move the clear method to the DbfilenameShelf.

+1

@jtcave
Copy link
Contributor Author

jtcave commented Jul 24, 2023

It appears that the ndbm.clear method, while substantially faster than the MutableMapping implementation, is still an order of magnitude slower than the fallback code.

Revised test script:

import os
import shelve
import tempfile
import timeit

def clearshelf(db):
    keys = list(db.keys())
    for k in keys:
        del db[k]

def test_clear(use_clear_method):
    with tempfile.TemporaryDirectory() as tempdir:
        filename = os.path.join(tempdir, "test-shelf")
        with shelve.open(filename) as db:
            items = {str(x):x for x in range(10000)}
            db.update(items)
            if use_clear_method:
                db.clear()
            else:
                clearshelf(db)
            assert len(db) == 0

test_with_clear = lambda: test_clear(True)
test_without_clear = lambda: test_clear(False)

if __name__ == "__main__":
    TRIALS = 5
    method_time = timeit.timeit(test_with_clear, number=TRIALS, globals=globals()) / TRIALS
    print(f"method:\t{method_time:.3}")
    func_time = timeit.timeit(test_without_clear, number=TRIALS, globals=globals()) / TRIALS
    print(f"func:\t{func_time:.3}")

Running on my branch:

james@iris cpython-build % ./python.exe ~/shelve-clear-test.py 
method:	0.596
func:	0.0272

Edited Jul 26: Finally got to test the other dbms, using Rocky Linux/aarch64 9.2.

With gdbm:

[james@trina cpython-build]$ ./python ~/shelve-clear-test.py 
method: 0.0775
func:   0.0305

By forcing dbm.dumb:

method: 13.5
func:   13.4

The times are a lot closer on gdbm and practically the same on dbm.dumb. Since most shelve users will be using these backends (ndbm is only really relevant on macOS), I don't think this is worth worrying about.

@jtcave jtcave requested a review from corona10 July 26, 2023 15:46
@corona10
Copy link
Member

@jtcave Would you like to rebase the PR?

The clear method used to be implemented by inheriting a mix-in from the
MutableMapping ABC. It was a poor fit for shelves, and a better
implementation is now in place
Because pythongh-107089 is peculiar to implementation details of dbm objects,
it would be less disruptive to implement it in the DbfilenameShelf
class, which is used for calls to shelve.open. Since it is known that
the backing object is specifically one of the dbm objects, its clear
method (see pythongh-107122) can be used with no fallback code.
Copy link
Member

@corona10 corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@corona10 corona10 merged commit 810d5d8 into python:main Jul 29, 2023
19 checks passed
@jtcave jtcave deleted the shelve-clear branch August 2, 2023 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants