Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup NTriples parser #1297

Closed
white-gecko opened this issue Apr 28, 2021 · 6 comments
Closed

Speedup NTriples parser #1297

white-gecko opened this issue Apr 28, 2021 · 6 comments

Comments

@white-gecko
Copy link
Member

After a successful speedup of the Turtle parser (#1266), we should see how we can speedup the the n-triples parser as well.

@rchateauneu
Copy link
Contributor

Is there a specific test data file on which we could focus, as a benchmark, please ?

@white-gecko
Copy link
Member Author

Sure you can use the orkg.nt from: https://github.com/AKSW/orkg-dump and you can use the wordnet file https://en-word.net/static/english-wordnet-2020.ttl.gz and convert it with rapper to n-triples. You can also use this one: http://dbpedia-mappings.tib.eu/databus-repo/kurzum/cleaned-data/geonames/2018.03.11/geonames_all.nt.bz2 (from https://databus.dbpedia.org/kurzum/cleaned-data/geonames/)

I came to the following run times on my system with the rdflib and the orkg file and as comparison with the raptor utils (which are compiled c code ;-) )

$ time python import.py orkg.nt nt
python import.py orkg.nt nt  22,37s user 0,46s system 99% cpu 22,870 total
$ time python import.py orkg.nt ttl
python import.py orkg.nt ttl  33,10s user 0,42s system 99% cpu 33,524 total
$ time rapper -i ntriples -o ntriples orkg.nt > /dev/null
rapper: Parsing URI file:///tmp/play/orkg.nt with parser ntriples
rapper: Serializing with serializer ntriples
rapper: Parsing returned 380647 triples
rapper -i ntriples -o ntriples orkg.nt > /dev/null  1,35s user 0,04s system 99% cpu 1,390 total
$ time rapper -i ntriples -o turtle orkg.nt > /dev/null
rapper: Parsing URI file:///tmp/play/orkg.nt with parser ntriples
rapper: Serializing with serializer turtle
rapper: Parsing returned 380647 triples
rapper -i ntriples -o turtle orkg.nt > /dev/null  3,80s user 0,05s system 99% cpu 3,848 total

@rchateauneu
Copy link
Contributor

rchateauneu commented May 2, 2021

There are possible speedups, for example by using a single regular expression to parse the input triples or quads into their three and four individual components respectively.

However, before a pull-request, I would like to confirm two points about opening and reading text files.

(1) This function opens a file in binary mode:

def _create_input_source_from_location(file, format, input_source, location):
...
    if absolute_location.startswith("file:///"):
        filename = url2pathname(absolute_location.replace("file:///", "/"))
        file = open(filename, "rb")

Later on, this implies a conversion from _io.BufferedReader with codecs.getreader("utf-8"). This conversion is needed because all regular expressions and further processing are on Unicode, not bytes. In a simple test which reads a nt files with Graph.parse(), codecs.readline() costs at least 10% of the whole time. I have tested replacing "rb" by "r" and there is a visible speedup.

(2) At the moment, W3CNTriplesParser.readline() does not use a plain readline() apparently because of terminator "\n\r", "\n", "r". But Python documentation (Py3 at least) says something which is what we want : https://docs.python.org/3/library/io.html

When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller.

To conclude, my intention is these two changes, on top of using less regular expressions:

  • In _create_input_source_from_location(), open text files in text mode "r" instead of "rb" (A little tweak is needed because of BOM, see test/test_n3.py::TestN3Case::testIssue156 ).
  • Use a simple file readline(), eliminating the need to internally store the file buffer content, instead of several reads() and an internal buffer to store the content.

The impact might be that some tests could be broken, and there might be Python 2 incompatibilities. But the code is more natural and faster I think. Please note that these two changes can be done separately, in other branches.

What is please your opinion about these changes ?

Many thanks in advance.

@white-gecko
Copy link
Member Author

That sounds very good. The change towards plain open should also be related to #1222 . Would you recommend using StringIO or should it be something different?

@rchateauneu
Copy link
Contributor

Your suggested change simplifies a lot the code and notably removes a lot of conversions. Does it run the tests and do you have an idea of the performance speedup ?
Because it does not change the overall structure of the code, it would be great to merge it soon, if possible.

@rchateauneu
Copy link
Contributor

Indeed, there might be interferences with #1222 and also #1276

However, there are some optimizations in these independent PRs , ready for merging:

@ghost ghost locked and limited conversation to collaborators Dec 25, 2021
@ghost ghost converted this issue into discussion #1558 Dec 25, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants