Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup of Turtle parse. Mostly micro-optimisations. #1266

Closed
wants to merge 11 commits into from
Closed

Speedup of Turtle parse. Mostly micro-optimisations. #1266

wants to merge 11 commits into from

Conversation

rchateauneu
Copy link
Contributor

@rchateauneu rchateauneu commented Feb 28, 2021

Contributes to fixing #1261

Proposed Changes

  • Minor changes to the code to optimize execution time

@coveralls
Copy link

coveralls commented Feb 28, 2021

Coverage Status

Coverage increased (+0.1%) to 75.526% when pulling 2af7553 on rchateauneu:master into 9f0f296 on RDFLib:master.

@@ -1570,8 +1584,8 @@ def strconst(self, argstr, i, delim):
if ch == delim1:
j = i
continue
elif ch in ('"', "'") and ch != delim1:
ustr = ustr + ch
elif ch in "\"'" and ch != delim1:
Copy link
Contributor

@FlorianLudwig FlorianLudwig Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old version (check if char in tuple) vs new version (check if char in str) is faster and more readable imho.

in {'"', "'"} should be even faster ;)

Copy link
Member

@white-gecko white-gecko Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FlorianLudwig it is in {'"', "'"} or should it be in ['"', "'"]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi
I pushed ['"', "'"] to my repository in "master" branch.
Do you want a PR ?

Copy link
Contributor

@FlorianLudwig FlorianLudwig Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@white-gecko

>>> timeit.timeit("'a' in {'a', 'b', 'c', 'd'}")
0.031824612990021706
>>> timeit.timeit("'a' in ['a', 'b', 'c', 'd']")
0.03463121401728131
>>> timeit.timeit("'a' in 'abcd'")
0.0507955830253195

The difference is not big, so using a set or a list does not make a big difference.

But I did mean { and not [

Copy link
Contributor Author

@rchateauneu rchateauneu Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am surprised that a string search is not faster as I imagined that it would be implemented with strchr(). Here is what I have on Python 3.6, Windows 7, the trend is confirmed:

>>> timeit.timeit("'a' in {'a', 'b', 'c', 'd'}")
0.03046370870097803
>>> timeit.timeit("'a' in ['a', 'b', 'c', 'd']")
0.028151531781134054
>>> timeit.timeit("'a' in 'abcd'")
0.031605845198640736
>>> timeit.timeit("'a' in {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'}")
0.03088098463007327
>>> timeit.timeit("'a' in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']")
0.027731048504392675
>>> timeit.timeit("'a' in 'abcdefghijklmnopqrstuvwxyz'")
0.033609475274715805

There are in notation3.py , many similar tests, like:

if c in "0123456789-+.":
if c not in _notNameChars:

Maybe we could do a similar change ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@white-gecko

I did the change in my fork of this single-line:
rchateauneu/rdflib@9652eca...5c6e279

It is ready for a PR if you wish so. It is a very simple change so if sets are preferred, I can do it too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consequently change such constructs to lists and then to sets. (lists very much slower for me, sets slightly faster then strings)
I did this comparison, as just one timing to get the first element is not representative. If this is not:
https://gist.github.com/white-gecko/1a787e1911c479b3a13abc3e7103e7f7

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason for the differences we see might be do to the way timeit is used. While the in check might be faster with set, the construction of the set is slower than for a string.

As far as i remember (!): In the real code, python does optimize the creation when executed as part of a module - but not when executing a string with timeit since it's not a module.

So I added another example go @white-gecko gist - maybe @rchateauneu can test this one as well :)

Btw with those micro optimizations it's best to have nothing else running to get proper results...

Copy link
Contributor Author

@rchateauneu rchateauneu Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, one aspect to consider - which is very theoretical indeed, but still - is that Cython, when working on bytes, and with specific hints, might sometimes be able to handle Python chars and strings, just like C-language strings.
To be completely honest, I do not have a clear answer in the general case, especially here where these strings/lists/sets are short.

I did a little test with dis, which gives a hint about the bytecode (which is not necessarily the actual execution time):

import dis

def func1(a):
    return x in "abc"
    
def func2(a):
    return x in ["a", "b", "c"]
    
def func3(a):
    return x in {"a", "b", "c"}
    
for func in [func1, func2, func3]:
    bytecode = dis.Bytecode(func)
    print(func.__name__)
    for instr in bytecode:
        print("    ", instr.opname)

The result is:

func1
     LOAD_GLOBAL
     LOAD_CONST
     COMPARE_OP
     RETURN_VALUE
func2
     LOAD_GLOBAL
     LOAD_CONST
     COMPARE_OP
     RETURN_VALUE
func3
     LOAD_GLOBAL
     LOAD_CONST
     COMPARE_OP
     RETURN_VALUE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall - this is a really small wins we are talking about here. The big wins come from optimizations of the algorithm. And @rchateauneu found awesome wins there (e.g. replacing a big loop with .find :))

While I started this discussion and it is quite interesting - I would merge any version. It doesn't actually matter in the big picture.

Readability is something to consider though.

@FlorianLudwig
Copy link
Contributor

Thank you for splitting up the merge request!

This one looks good to me

@white-gecko
Copy link
Member

on master

time python import.py wordnet.ttl ttl
python import.py wordnet.ttl ttl  208,93s user 2,40s system 99% cpu 3:31,53 total

with these fixes

$ time python import.py wordnet.ttl ttl
python import.py wordnet.ttl ttl  166,64s user 2,99s system 99% cpu 2:50,03 total

So around 20% faster in this case 👍 This is a very good result.

What concerns me, is that although the tests are passing some newly introduced lined are not covered.
Maybe somebody is willing to contribute test cases so we can be sure, that there is no degradation of the correctness.

https://coveralls.io/builds/37530950/source?filename=rdflib%2Fplugins%2Fparsers%2Fnotation3.py

@white-gecko
Copy link
Member

@rchateauneu why did you merge the cython setup into this pull-request? It would be better, if we can deal with these micro-optimizations and the cython setup in #1269 separately. I even think we can merge this pull-request faster, if you can reset this pull-request to eefaa37 and maybe cherry-pick 3597d55 to it, if it brings significant improvement. The optimizations to the store should be part of an additional pull-request.

@rchateauneu
Copy link
Contributor Author

rchateauneu commented Mar 3, 2021

@white-gecko
My mistake for adding Cython.

I created a new branch and sent a PR for:
https://github.com/rchateauneu/rdflib/tree/speedup_notation3_only

... with only these two commits:
cherry picked from commit 9652eca
cherry picked from commit 3597d55

As suggested, I also created a new branch (and PR) for the change in memory.py:
https://github.com/RDFLib/rdflib/compare/master...rchateauneu:memory_only?expand=1

It contains a single commit:
cherry picked from commit 85e1cfe

@white-gecko
Copy link
Member

See: #1272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants