Speedup NTriples parser #1297

white-gecko · 2021-04-28T09:32:57Z

After a successful speedup of the Turtle parser (#1266), we should see how we can speedup the the n-triples parser as well.

rchateauneu · 2021-04-29T08:35:57Z

Is there a specific test data file on which we could focus, as a benchmark, please ?

white-gecko · 2021-04-30T13:03:24Z

Sure you can use the orkg.nt from: https://github.com/AKSW/orkg-dump and you can use the wordnet file https://en-word.net/static/english-wordnet-2020.ttl.gz and convert it with rapper to n-triples. You can also use this one: http://dbpedia-mappings.tib.eu/databus-repo/kurzum/cleaned-data/geonames/2018.03.11/geonames_all.nt.bz2 (from https://databus.dbpedia.org/kurzum/cleaned-data/geonames/)

I came to the following run times on my system with the rdflib and the orkg file and as comparison with the raptor utils (which are compiled c code ;-) )

$ time python import.py orkg.nt nt
python import.py orkg.nt nt  22,37s user 0,46s system 99% cpu 22,870 total
$ time python import.py orkg.nt ttl
python import.py orkg.nt ttl  33,10s user 0,42s system 99% cpu 33,524 total
$ time rapper -i ntriples -o ntriples orkg.nt > /dev/null
rapper: Parsing URI file:///tmp/play/orkg.nt with parser ntriples
rapper: Serializing with serializer ntriples
rapper: Parsing returned 380647 triples
rapper -i ntriples -o ntriples orkg.nt > /dev/null  1,35s user 0,04s system 99% cpu 1,390 total
$ time rapper -i ntriples -o turtle orkg.nt > /dev/null
rapper: Parsing URI file:///tmp/play/orkg.nt with parser ntriples
rapper: Serializing with serializer turtle
rapper: Parsing returned 380647 triples
rapper -i ntriples -o turtle orkg.nt > /dev/null  3,80s user 0,05s system 99% cpu 3,848 total

rchateauneu · 2021-05-02T06:45:57Z

There are possible speedups, for example by using a single regular expression to parse the input triples or quads into their three and four individual components respectively.

However, before a pull-request, I would like to confirm two points about opening and reading text files.

(1) This function opens a file in binary mode:

def _create_input_source_from_location(file, format, input_source, location):
...
    if absolute_location.startswith("file:///"):
        filename = url2pathname(absolute_location.replace("file:///", "/"))
        file = open(filename, "rb")

Later on, this implies a conversion from _io.BufferedReader with codecs.getreader("utf-8"). This conversion is needed because all regular expressions and further processing are on Unicode, not bytes. In a simple test which reads a nt files with Graph.parse(), codecs.readline() costs at least 10% of the whole time. I have tested replacing "rb" by "r" and there is a visible speedup.

(2) At the moment, W3CNTriplesParser.readline() does not use a plain readline() apparently because of terminator "\n\r", "\n", "r". But Python documentation (Py3 at least) says something which is what we want : https://docs.python.org/3/library/io.html

When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller.

To conclude, my intention is these two changes, on top of using less regular expressions:

In _create_input_source_from_location(), open text files in text mode "r" instead of "rb" (A little tweak is needed because of BOM, see test/test_n3.py::TestN3Case::testIssue156 ).
Use a simple file readline(), eliminating the need to internally store the file buffer content, instead of several reads() and an internal buffer to store the content.

The impact might be that some tests could be broken, and there might be Python 2 incompatibilities. But the code is more natural and faster I think. Please note that these two changes can be done separately, in other branches.

What is please your opinion about these changes ?

Many thanks in advance.

white-gecko · 2021-05-06T17:11:15Z

That sounds very good. The change towards plain open should also be related to #1222 . Would you recommend using StringIO or should it be something different?

rchateauneu · 2021-05-06T18:22:10Z

Your suggested change simplifies a lot the code and notably removes a lot of conversions. Does it run the tests and do you have an idea of the performance speedup ?
Because it does not change the overall structure of the code, it would be great to merge it soon, if possible.

rchateauneu · 2021-05-21T07:29:21Z

Indeed, there might be interferences with #1222 and also #1276

However, there are some optimizations in these independent PRs , ready for merging:

ghost locked and limited conversation to collaborators Dec 25, 2021

ghost converted this issue into discussion #1558 Dec 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Speedup NTriples parser #1297

Speedup NTriples parser #1297

white-gecko commented Apr 28, 2021

rchateauneu commented Apr 29, 2021

white-gecko commented Apr 30, 2021

rchateauneu commented May 2, 2021 •

edited

Loading

white-gecko commented May 6, 2021

rchateauneu commented May 6, 2021

rchateauneu commented May 21, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Speedup NTriples parser #1297

Speedup NTriples parser #1297

Comments

white-gecko commented Apr 28, 2021

rchateauneu commented Apr 29, 2021

white-gecko commented Apr 30, 2021

rchateauneu commented May 2, 2021 • edited Loading

white-gecko commented May 6, 2021

rchateauneu commented May 6, 2021

rchateauneu commented May 21, 2021

This issue was moved to a discussion.

rchateauneu commented May 2, 2021 •

edited

Loading