Speedup of Turtle parse. Mostly micro-optimisations. #1266

rchateauneu · 2021-02-28T21:52:34Z

Contributes to fixing #1261

Proposed Changes

Minor changes to the code to optimize execution time

coveralls · 2021-02-28T22:08:01Z

Coverage increased (+0.1%) to 75.526% when pulling 2af7553 on rchateauneu:master into 9f0f296 on RDFLib:master.

FlorianLudwig · 2021-03-01T08:32:12Z

rdflib/plugins/parsers/notation3.py

@@ -1570,8 +1584,8 @@ def strconst(self, argstr, i, delim):
            if ch == delim1:
                j = i
                continue
-            elif ch in ('"', "'") and ch != delim1:
-                ustr = ustr + ch
+            elif ch in "\"'" and ch != delim1:


The old version (check if char in tuple) vs new version (check if char in str) is faster and more readable imho.

in {'"', "'"} should be even faster ;)

@FlorianLudwig it is in {'"', "'"} or should it be in ['"', "'"]?

Hi
I pushed ['"', "'"] to my repository in "master" branch.
Do you want a PR ?

@white-gecko

>>> timeit.timeit("'a' in {'a', 'b', 'c', 'd'}") 0.031824612990021706 >>> timeit.timeit("'a' in ['a', 'b', 'c', 'd']") 0.03463121401728131 >>> timeit.timeit("'a' in 'abcd'") 0.0507955830253195

The difference is not big, so using a set or a list does not make a big difference.

But I did mean { and not [

Yes, I am surprised that a string search is not faster as I imagined that it would be implemented with strchr(). Here is what I have on Python 3.6, Windows 7, the trend is confirmed:

>>> timeit.timeit("'a' in {'a', 'b', 'c', 'd'}")
0.03046370870097803
>>> timeit.timeit("'a' in ['a', 'b', 'c', 'd']")
0.028151531781134054
>>> timeit.timeit("'a' in 'abcd'")
0.031605845198640736
>>> timeit.timeit("'a' in {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'}")
0.03088098463007327
>>> timeit.timeit("'a' in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']")
0.027731048504392675
>>> timeit.timeit("'a' in 'abcdefghijklmnopqrstuvwxyz'")
0.033609475274715805

There are in notation3.py , many similar tests, like:

if c in "0123456789-+.":
if c not in _notNameChars:

Maybe we could do a similar change ?

@white-gecko

I did the change in my fork of this single-line:
rchateauneu/rdflib@9652eca...5c6e279

It is ready for a PR if you wish so. It is a very simple change so if sets are preferred, I can do it too.

I did consequently change such constructs to lists and then to sets. (lists very much slower for me, sets slightly faster then strings)
I did this comparison, as just one timing to get the first element is not representative. If this is not:
https://gist.github.com/white-gecko/1a787e1911c479b3a13abc3e7103e7f7

One reason for the differences we see might be do to the way timeit is used. While the in check might be faster with set, the construction of the set is slower than for a string.

As far as i remember (!): In the real code, python does optimize the creation when executed as part of a module - but not when executing a string with timeit since it's not a module.

So I added another example go @white-gecko gist - maybe @rchateauneu can test this one as well :)

Btw with those micro optimizations it's best to have nothing else running to get proper results...

Also, one aspect to consider - which is very theoretical indeed, but still - is that Cython, when working on bytes, and with specific hints, might sometimes be able to handle Python chars and strings, just like C-language strings.
To be completely honest, I do not have a clear answer in the general case, especially here where these strings/lists/sets are short.

I did a little test with dis, which gives a hint about the bytecode (which is not necessarily the actual execution time):

import dis def func1(a): return x in "abc" def func2(a): return x in ["a", "b", "c"] def func3(a): return x in {"a", "b", "c"} for func in [func1, func2, func3]: bytecode = dis.Bytecode(func) print(func.__name__) for instr in bytecode: print(" ", instr.opname)

The result is:

func1 LOAD_GLOBAL LOAD_CONST COMPARE_OP RETURN_VALUE func2 LOAD_GLOBAL LOAD_CONST COMPARE_OP RETURN_VALUE func3 LOAD_GLOBAL LOAD_CONST COMPARE_OP RETURN_VALUE

overall - this is a really small wins we are talking about here. The big wins come from optimizations of the algorithm. And @rchateauneu found awesome wins there (e.g. replacing a big loop with .find :))

While I started this discussion and it is quite interesting - I would merge any version. It doesn't actually matter in the big picture.

Readability is something to consider though.

FlorianLudwig · 2021-03-01T08:34:20Z

Thank you for splitting up the merge request!

This one looks good to me

white-gecko · 2021-03-01T13:25:52Z

on master

time python import.py wordnet.ttl ttl
python import.py wordnet.ttl ttl  208,93s user 2,40s system 99% cpu 3:31,53 total

with these fixes

$ time python import.py wordnet.ttl ttl
python import.py wordnet.ttl ttl  166,64s user 2,99s system 99% cpu 2:50,03 total

So around 20% faster in this case 👍 This is a very good result.

What concerns me, is that although the tests are passing some newly introduced lined are not covered.
Maybe somebody is willing to contribute test cases so we can be sure, that there is no degradation of the correctness.

https://coveralls.io/builds/37530950/source?filename=rdflib%2Fplugins%2Fparsers%2Fnotation3.py

rdflib/plugins/parsers/notation3.py

…etup_cython

white-gecko · 2021-03-03T11:01:56Z

@rchateauneu why did you merge the cython setup into this pull-request? It would be better, if we can deal with these micro-optimizations and the cython setup in #1269 separately. I even think we can merge this pull-request faster, if you can reset this pull-request to eefaa37 and maybe cherry-pick 3597d55 to it, if it brings significant improvement. The optimizations to the store should be part of an additional pull-request.

rchateauneu · 2021-03-03T13:23:17Z

@white-gecko
My mistake for adding Cython.

I created a new branch and sent a PR for:
https://github.com/rchateauneu/rdflib/tree/speedup_notation3_only

... with only these two commits:
cherry picked from commit 9652eca
cherry picked from commit 3597d55

As suggested, I also created a new branch (and PR) for the change in memory.py:
https://github.com/RDFLib/rdflib/compare/master...rchateauneu:memory_only?expand=1

It contains a single commit:
cherry picked from commit 85e1cfe

white-gecko · 2021-03-05T12:45:34Z

See: #1272

rchateauneu added 2 commits February 28, 2021 09:44

Cython in setup.

4b1ad06

Speedup.

9652eca

FlorianLudwig reviewed Mar 1, 2021

View reviewed changes

FlorianLudwig approved these changes Mar 1, 2021

View reviewed changes

Cleaner code.

5c6e279

white-gecko mentioned this pull request Mar 1, 2021

Speedup with Cython and Turtle parsing. #1263

Closed