Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup __add_triple_context. #1271

Merged
merged 3 commits into from
Mar 9, 2021
Merged

Speedup __add_triple_context. #1271

merged 3 commits into from
Mar 9, 2021

Conversation

rchateauneu
Copy link
Contributor

@rchateauneu rchateauneu commented Mar 3, 2021

Related to #1261.

(cherry picked from commit 85e1cfe)

This eliminates between two and five lookups in the container self.__tripleContexts (depending on the execution) by storing its reference in the variable triple_context = self.__tripleContexts[triple], and reusing it.

(cherry picked from commit 85e1cfe)
@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) to 75.48% when pulling 8086068 on rchateauneu:memory_only into 9f0f296 on RDFLib:master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) to 75.48% when pulling 8086068 on rchateauneu:memory_only into 9f0f296 on RDFLib:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) to 75.48% when pulling 8086068 on rchateauneu:memory_only into 9f0f296 on RDFLib:master.

@coveralls
Copy link

coveralls commented Mar 3, 2021

Coverage Status

Coverage increased (+0.08%) to 75.492% when pulling 53164b2 on rchateauneu:memory_only into 9f0f296 on RDFLib:master.

Copy link
Member

@nicholascar nicholascar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing all tests and seems to make sense to me

if self.__tripleContexts[triple] == self.__defaultContexts:
del self.__tripleContexts[triple]
if triple_context == self.__defaultContexts:
del triple_context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this delete the context from self.__tripleContexts? It appears to me, that del triple_context just does nothing.

Copy link
Contributor Author

@rchateauneu rchateauneu Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it strongly seems you are right.

>>> x={"a","b"}
>>> y=x
>>> del y``
>>> x``
set(['a', 'b'])``
>>> y``
Traceback (most recent call last):``
File "<stdin>", line 1, in <module>``
NameError: name 'y' is not defined``

It should probably be:

del self.__tripleContexts[triple]

The nasty thing is that it does not appear in the results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the fix in eb5e025

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, thanks @white-gecko

@white-gecko
Copy link
Member

For the wordnet file (cf. #1261) I could see an improvement of 2% for the total time and for some file of mine with just 807682 triples it was 4% faster (the lower table). My setup was not dedicated to performing the test but ran daily things like video conference next to the tests, this explains the high variance. The measurements were executed alternating (once improved, then master, then again improved, …) to provide equal chances.

branch user system cpu total
improved 244,75s 7,96s 99,00 % 04:14,95
improved 263,58s 3,91s 99,00 % 04:28,22
improved 240,33s 3,85s 99,00 % 04:04,50
improved 226,38s 3,04s 99,00 % 03:49,66
improved 224,06s 2,69s 99,00 % 03:46,95
improved 220,00s 2,74s 99,00 % 03:43,05
improved 213,12s 2,37s 99,00 % 03:35,66
         
master 266,15s 3,06s 99,00 % 04:29,43
master 291,91s 3,39s 99,00 % 04:55,52
master 241,72s 2,91s 99,00 % 04:04,85
master 222,25s 2,52s 99,00 % 03:45,02
master 208,31s 2,28s 99,00 % 03:30,64
master 225,60s 2,14s 99,00 % 03:47,79
master 218,31s 5,93s 99,00 % 03:44,31
branch user system cpu total
improved 60,59s 0,96s 99,00 % 01:01,60
improved 60,04s 0,88s 99,00 % 01:00,93
master 64,03s 0,68s 99,00 % 01:04,73
master 62,26s 0,65s 99,00 % 01:03,06

So if the code is corrected, still this is missing: #1271 (comment), it can be merged.

@white-gecko
Copy link
Member

The tests fail now, I can reproduce this locally, tests pass for 8086068 but fail for eb5e025. Looking at the code eb5e025 should be correct, but … @ashleysommer can you explain us, what is happening?

@rchateauneu
Copy link
Contributor Author

The tests fail now, I can reproduce this locally, tests pass for 8086068 but fail for eb5e025. Looking at the code eb5e025 should be correct, but … @ashleysommer can you explain us, what is happening?

It looks promising performance-wise but if you do not mind, I will restart from scratch to understand where this failures come from. What do you think ? Could we put this PR in draft mode until it is properly tested and understood ?

@white-gecko white-gecko marked this pull request as draft March 8, 2021 10:52
@white-gecko
Copy link
Member

I've converted it to draft mode, as requested. But as it seems the test pass now. Raising KeyError instead of IndexError might be part of understanding the situation. So you agree @rchateauneu ?
@nicholascar does your approval still hold? @ashleysommer do you want to take a look?

@rchateauneu
Copy link
Contributor Author

KeyError instead of IndexError

Yes. This could not work.

@ashleysommer
Copy link
Contributor

Hi, I'm looking into this now.
On the surface this looks like a great change. And after fixing the KeyError mixup, looks like tests are passing normally.

I did have in mind to try this kind of optimization when I wrote the new MemoryStore implementation, but I didn't get around to it.
Thanks @rchateauneu this is great work.

@ashleysommer
Copy link
Contributor

KeyError instead of IndexError

Lately I've been using LookupError it catches both KeyError and IndexError, and means you don't have to think about or guess what kind of structure you're accessing.

@rchateauneu
Copy link
Contributor Author

I have done more tests to check the memory behaviour using tracemalloc. The test script, using english-wordnet-2020.ttl, is attached:
import.py.txt

The result are identical, with about three runs each. Surprisingly, the same code might give slightly different number if the machine has a different load.

NEW VERSION SHA-1: 53164b2168e950364ee704c1ace9bb6cfd7817f5 (branch memory_only)

#1: rdflib/store.py:216: 866369.5 KiB
    self.dispatcher.dispatch(TripleAddedEvent(triple=triple, context=context))
#2: rdflib/term.py:236: 702543.7 KiB
    rt = str.__new__(cls, value)
#3: stores/memory.py:257: 395230.7 KiB
    p = sp[subject] = {}
282 other: 1763466.8 KiB
Total allocated size: 3727610.6 KiB

ORIGINAL SHA-1: dca850e9f649689b73c685148f7f01d7448bf4d8 (branch master)

#1: rdflib/store.py:216: 866369.5 KiB
    self.dispatcher.dispatch(TripleAddedEvent(triple=triple, context=context))
#2: rdflib/term.py:236: 702543.7 KiB
    rt = str.__new__(cls, value)
#3: stores/memory.py:257: 395230.7 KiB
    p = sp[subject] = {}
282 other: 1763466.7 KiB
Total allocated size: 3727610.6 KiB

@white-gecko white-gecko marked this pull request as ready for review March 9, 2021 18:34
@white-gecko white-gecko merged commit 83a6fda into RDFLib:master Mar 9, 2021
@white-gecko
Copy link
Member

white-gecko commented Mar 9, 2021

Now we are done :-) Thank you.

@white-gecko white-gecko added this to the rdflib 6.0.0 milestone Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants