Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why the loading of files is so slow? #1261

Closed
m1ci opened this issue Feb 25, 2021 · 3 comments
Closed

why the loading of files is so slow? #1261

m1ci opened this issue Feb 25, 2021 · 3 comments

Comments

@m1ci
Copy link

m1ci commented Feb 25, 2021

Hi,
I am loading a 165 MB file and the loading process takes ages. I would expect to be loaded in a matter of seconds.
Am I doing smth wrong or this is a normal?
Here is my code which is very simple:

import rdflib
g = rdflib.Graph()
g.load('data.ttl', format="turtle");
@white-gecko
Copy link
Member

white-gecko commented Feb 25, 2021

The file is from: https://en-word.net/static/english-wordnet-2020.ttl.gz

In my setup it took 4 minutes

time python import.py
python import.py  235,80s user 3,08s system 99% cpu 4:00,02 total

I think our turtle parser is not very good.

To make some testing easier I've changed the code to:

import sys
import rdflib
g = rdflib.Graph()
g.parse(sys.argv[1], format=sys.argv[2])

Also I converted the file with rapper (raptor2-utils) to n-triples

$ time rapper -i turtle wordnet.ttl > wordnet.nt
rapper: Parsing URI file:///tmp/milan/wordnet.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Parsing returned 3103323 triples
rapper -i turtle wordnet.ttl > wordnet.nt  10,66s user 0,98s system 99% cpu 11,653 total
$ time rapper -i turtle -o turtle wordnet.ttl > wordnet_rapper.ttl
rapper: Parsing URI file:///tmp/milan/wordnet.ttl with parser turtle
rapper: Serializing with serializer turtle
rapper: Parsing returned 3103323 triples
rapper -i turtle -o turtle wordnet.ttl > wordnet_rapper.ttl  32,84s user 0,84s system 99% cpu 33,712 total

(The last conversion is just from turtle to turtle, to see if there is any difference between the original serialization and the rapper serialization)

And compared parsing if the n-triples file with the turlte parser as well as with the n-triples parser

$ time python import.py wordnet.nt nt
python import.py wordnet.nt nt  140,03s user 4,35s system 97% cpu 2:28,76 total
$ time python import.py wordnet.nt ttl
python import.py wordnet.nt ttl  320,25s user 3,22s system 99% cpu 5:24,02 total
$ time python import.py wordnet_rapper.ttl ttl 
python import.py wordnet_rapper.ttl ttl  212,88s user 2,66s system 99% cpu 3:35,73 total
$ time python import.py wordnet.ttl ttl       
python import.py wordnet.ttl ttl  231,11s user 2,83s system 99% cpu 3:54,22 total

So we can see a lot of variation between the user and system times

Maybe somebody wants to do this comparison with Cython #1250 .

@white-gecko
Copy link
Member

After doing some profiling, I suspect the uri_ref2 and skipSpace methods of the notation3.py

def uri_ref2(self, argstr, i, res):

def skipSpace(self, argstr, i):

@rchateauneu
Copy link
Contributor

rchateauneu commented Feb 26, 2021

@white-gecko

Yes absolutely. Please see #1262

I have committed changed in notation3.py in my branch, I can send a pull request if you want,
To https://github.com/rchateauneu/rdflib.git
4289f0e..860f985 master -> master

@ghost ghost locked and limited conversation to collaborators Dec 25, 2021
@ghost ghost converted this issue into discussion #1544 Dec 25, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants