-
Notifications
You must be signed in to change notification settings - Fork 555
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup NTriples parser #1297
Comments
Is there a specific test data file on which we could focus, as a benchmark, please ? |
Sure you can use the orkg.nt from: https://github.com/AKSW/orkg-dump and you can use the wordnet file https://en-word.net/static/english-wordnet-2020.ttl.gz and convert it with rapper to n-triples. You can also use this one: http://dbpedia-mappings.tib.eu/databus-repo/kurzum/cleaned-data/geonames/2018.03.11/geonames_all.nt.bz2 (from https://databus.dbpedia.org/kurzum/cleaned-data/geonames/) I came to the following run times on my system with the rdflib and the orkg file and as comparison with the raptor utils (which are compiled c code ;-) )
|
There are possible speedups, for example by using a single regular expression to parse the input triples or quads into their three and four individual components respectively. However, before a pull-request, I would like to confirm two points about opening and reading text files. (1) This function opens a file in binary mode:
Later on, this implies a conversion from _io.BufferedReader with codecs.getreader("utf-8"). This conversion is needed because all regular expressions and further processing are on Unicode, not bytes. In a simple test which reads a nt files with Graph.parse(), codecs.readline() costs at least 10% of the whole time. I have tested replacing "rb" by "r" and there is a visible speedup. (2) At the moment, W3CNTriplesParser.readline() does not use a plain readline() apparently because of terminator "\n\r", "\n", "r". But Python documentation (Py3 at least) says something which is what we want : https://docs.python.org/3/library/io.html
To conclude, my intention is these two changes, on top of using less regular expressions:
The impact might be that some tests could be broken, and there might be Python 2 incompatibilities. But the code is more natural and faster I think. Please note that these two changes can be done separately, in other branches. What is please your opinion about these changes ? Many thanks in advance. |
That sounds very good. The change towards plain open should also be related to #1222 . Would you recommend using StringIO or should it be something different? |
Your suggested change simplifies a lot the code and notably removes a lot of conversions. Does it run the tests and do you have an idea of the performance speedup ? |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
After a successful speedup of the Turtle parser (#1266), we should see how we can speedup the the n-triples parser as well.
The text was updated successfully, but these errors were encountered: