-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup of Turtle parse. Mostly micro-optimisations. #1266
Conversation
rdflib/plugins/parsers/notation3.py
Outdated
@@ -1570,8 +1584,8 @@ def strconst(self, argstr, i, delim): | |||
if ch == delim1: | |||
j = i | |||
continue | |||
elif ch in ('"', "'") and ch != delim1: | |||
ustr = ustr + ch | |||
elif ch in "\"'" and ch != delim1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old version (check if char in tuple) vs new version (check if char in str) is faster and more readable imho.
in {'"', "'"}
should be even faster ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FlorianLudwig it is in {'"', "'"}
or should it be in ['"', "'"]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi
I pushed ['"', "'"] to my repository in "master" branch.
Do you want a PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> timeit.timeit("'a' in {'a', 'b', 'c', 'd'}")
0.031824612990021706
>>> timeit.timeit("'a' in ['a', 'b', 'c', 'd']")
0.03463121401728131
>>> timeit.timeit("'a' in 'abcd'")
0.0507955830253195
The difference is not big, so using a set or a list does not make a big difference.
But I did mean {
and not [
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am surprised that a string search is not faster as I imagined that it would be implemented with strchr(). Here is what I have on Python 3.6, Windows 7, the trend is confirmed:
>>> timeit.timeit("'a' in {'a', 'b', 'c', 'd'}")
0.03046370870097803
>>> timeit.timeit("'a' in ['a', 'b', 'c', 'd']")
0.028151531781134054
>>> timeit.timeit("'a' in 'abcd'")
0.031605845198640736
>>> timeit.timeit("'a' in {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'}")
0.03088098463007327
>>> timeit.timeit("'a' in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']")
0.027731048504392675
>>> timeit.timeit("'a' in 'abcdefghijklmnopqrstuvwxyz'")
0.033609475274715805
There are in notation3.py , many similar tests, like:
if c in "0123456789-+.":
if c not in _notNameChars:
Maybe we could do a similar change ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the change in my fork of this single-line:
rchateauneu/rdflib@9652eca...5c6e279
It is ready for a PR if you wish so. It is a very simple change so if sets are preferred, I can do it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consequently change such constructs to lists and then to sets. (lists very much slower for me, sets slightly faster then strings)
I did this comparison, as just one timing to get the first element is not representative. If this is not:
https://gist.github.com/white-gecko/1a787e1911c479b3a13abc3e7103e7f7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason for the differences we see might be do to the way timeit is used. While the in
check might be faster with set
, the construction of the set
is slower than for a string.
As far as i remember (!): In the real code, python does optimize the creation when executed as part of a module - but not when executing a string with timeit since it's not a module.
So I added another example go @white-gecko gist - maybe @rchateauneu can test this one as well :)
Btw with those micro optimizations it's best to have nothing else running to get proper results...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, one aspect to consider - which is very theoretical indeed, but still - is that Cython, when working on bytes, and with specific hints, might sometimes be able to handle Python chars and strings, just like C-language strings.
To be completely honest, I do not have a clear answer in the general case, especially here where these strings/lists/sets are short.
I did a little test with dis, which gives a hint about the bytecode (which is not necessarily the actual execution time):
import dis
def func1(a):
return x in "abc"
def func2(a):
return x in ["a", "b", "c"]
def func3(a):
return x in {"a", "b", "c"}
for func in [func1, func2, func3]:
bytecode = dis.Bytecode(func)
print(func.__name__)
for instr in bytecode:
print(" ", instr.opname)
The result is:
func1
LOAD_GLOBAL
LOAD_CONST
COMPARE_OP
RETURN_VALUE
func2
LOAD_GLOBAL
LOAD_CONST
COMPARE_OP
RETURN_VALUE
func3
LOAD_GLOBAL
LOAD_CONST
COMPARE_OP
RETURN_VALUE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall - this is a really small wins we are talking about here. The big wins come from optimizations of the algorithm. And @rchateauneu found awesome wins there (e.g. replacing a big loop with .find
:))
While I started this discussion and it is quite interesting - I would merge any version. It doesn't actually matter in the big picture.
Readability is something to consider though.
Thank you for splitting up the merge request! This one looks good to me |
on master
with these fixes
So around 20% faster in this case 👍 This is a very good result. What concerns me, is that although the tests are passing some newly introduced lined are not covered. https://coveralls.io/builds/37530950/source?filename=rdflib%2Fplugins%2Fparsers%2Fnotation3.py |
@rchateauneu why did you merge the cython setup into this pull-request? It would be better, if we can deal with these micro-optimizations and the cython setup in #1269 separately. I even think we can merge this pull-request faster, if you can reset this pull-request to eefaa37 and maybe cherry-pick 3597d55 to it, if it brings significant improvement. The optimizations to the store should be part of an additional pull-request. |
@white-gecko I created a new branch and sent a PR for: ... with only these two commits: As suggested, I also created a new branch (and PR) for the change in memory.py: It contains a single commit: |
See: #1272 |
Contributes to fixing #1261
Proposed Changes