Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleIOHandler prints xsd datatype as an abreviated string, which is subtly off #54

Closed
JervenBolleman opened this issue May 2, 2024 · 10 comments

Comments

@JervenBolleman
Copy link

Dear BioPax developers,

On line 669 of the SimpleIOHandler the datatype print is slightly off.

Using a tool like rapper or riot to convert the output of the rdf/xml to n-triples, we see the difference (example from rhea-db)

<http://biopax.rhea-db.org/level3/57967_stoichiometry_right_3249> 
  <http://www.biopax.org/release/biopax-level3.owl#stoichiometricCoefficient> 
     "1.0"^^<xsd:float> .

was printed but we expect

<http://biopax.rhea-db.org/level3/57967_stoichiometry_right_3249>
  <http://www.biopax.org/release/biopax-level3.owl#stoichiometricCoefficient> 
    "1.0"^^<http://www.w3.org/2001/XMLSchema#float> .

Quite a lot of tools auto-correct this difference down stream. But if one uses a tool that is spec rdf/xml compliant this gives a difference when doing SPARQL values comparison. Changing from a numeric comparison to a string comparison.

We will open a PR with this fix soon.

Regards,
Jerven

@IgorRodchenkov
Copy link
Member

IgorRodchenkov commented May 3, 2024 via email

@JervenBolleman
Copy link
Author

I think it is a bug. The other tools are compliant in how they interpret the string, this also includes RDFlib and RDF4j. Why it does not show up as a bug for most people is because the the value "xsd:float" is also an IRI. Just not the IRI that one expects "http://www.w3.org/2001/XMLSchema#float". This is because rdf/xml uses xml infoset for iri expansion and that only works on elements and not on attributes (other than the xml:base setting which does not apply here).

Now this only matters if someone actually uses RDF tools like SPARQL. Which not many people might, but we do.

So currently the full iri of the datatype is just <xsd:float> not <http://www.w3.org/2001/XMLSchema#float>, this is an unknown non standard datatype which means sparql uses string comparison for ordering and worse. When writing a sparql query things don't match which on first sight should.

e.g. when writing a sparql query like this

PREFIX xsd:<http://www.w3.org/2001/XMLSchema#float>
PREFIX biopax3:<http://www.biopax.org/release/biopax-level3.owl#>
SELECT * 
WHERE {
 ?x biopax3:stoichiometricCoefficient "1.0"^^xsd:float .
}

fails to return with the current biopax file.
The below would work

PREFIX biopax3:<http://www.biopax.org/release/biopax-level3.owl#>
SELECT * 
WHERE {
 ?x biopax3:stoichiometricCoefficient "1.0"^^<xsd:float> .
}

but that is because

PREFIX xsd:<http://www.w3.org/2001/XMLSchema#float>
SELECT ?same
WHERE { BIND((<xdf:float> = xsd:float) AS ?same) }

I saw the comments to the pull request about this being super long and ugly, Which I understand and feel. In that case I think we can use an XML entity.

<?xml version="1.0"?>

<!DOCTYPE rdf [
<!ENTITY xsd "http://www.w3.org/2001/XMLSchema#">
]>

...
 <bp:comment rdf:datatype="&xsd;string"
...

And because we live in the world of RDF 1.1 we can actually delete all the xsd:string attributes.

Of course you don't need to accept either option nor the pull request, I just hate opening issues in an important but underfunded project without trying to make life as easy as possible for the kind people maintaining it.

For us it is trivial to "patch" the files before publishing. Because all the other work in new updates in Paxtools are totally worth that small hassle. I just want to avoid someone else wondering about why their query doesn't work when it all looks right.

JervenBolleman added a commit to JervenBolleman/Paxtools that referenced this issue May 5, 2024
…ntities to make long IRI's for xsd datatype references without introducing a lot of bytes
@JervenBolleman
Copy link
Author

Wanted to add that you can see the effect using comunica click the execute button. And see two datatypes. One for the "" as given and one extracted from the hcyc.owl test file.

@IgorRodchenkov
Copy link
Member

IgorRodchenkov commented May 9, 2024

Oh, I get it - admit that I (re-)introduced this regression bug in 2023-07; it seems made it to Paxtools release v5.3.0 and v5.2.1 too (older v5.1.0 seems fine...)

I like the first PR #55 more (small fix), with comments.

@JervenBolleman
Copy link
Author

@IgorRodchenkov thank you! While not that important you might also want to pick up the part of the second patch I proposed regarding skipping doctype and entity declarations. It will give you more flexibility in the future.

@IgorRodchenkov
Copy link
Member

Thanks for reporting, @JervenBolleman!

I am likely actually going to make that change to skip printing rdf:datatype attributes for "string" type/range BioPAX OWL properties. Just trying to do some more research and confirm if it's safe (it definitely makes files smaller!) But I am stll not so sure...

May I ask you which dataset were you working with? Were you just converting from existing BioPAX RDF/XML datafile to n-triples or were you building own BioPAX model with Paxtools (6.0.0-SNAPSHOT?), outputting to RDF/XML, then converting?

IgorRodchenkov added a commit that referenced this issue May 12, 2024
…to v3 (due to weird jsonld conversion issues when using jena 4 or 5).
@IgorRodchenkov
Copy link
Member

OMG... It seems that Paxtools IO should not write to the output OWL (RDF/XML) files any of those rdf:datatype attributes at all for the BioPAX model property values (e.g. bp:name, bp:db, bp:id etc. primitive props) because they are well defined in the BioPAX L3 specification (which we import into every such file and also use as namespace prefix, usually "bp").
See BioPAX specs: http://www.biopax.org/release/biopax-level3.owl# and e.g. like <rdfs:range rdf:resource="&xsd;string"/> definitions there for OWL props...

@JervenBolleman
Copy link
Author

@IgorRodchenkov the specification/owl import says what the range of things are. However, the SPARQL and RDF layers don't care about that. Even OWL with datatype reasoning might not do what you expect. I think it as likely that an OWL reasoner would say that the pax file is inconsistent with the definition as it would be to coeerce a values datatype to a different one. In any case before you implement that may I suggest testing it out first, with a wide variety of tools.

@JervenBolleman
Copy link
Author

We are updating the code that is used to generate the biopax level2 of Rhea and noticed the change in output.

@IgorRodchenkov
Copy link
Member

IgorRodchenkov commented May 13, 2024

@JervenBolleman please do not use BioPAX Level2 use Level3 if you can (Paxtools can in fact autoconvert L2 to L3).
I made an experimantal demo file w/o any rdf:datatype in there; could you test if it's fine, works with you?
demo-pathway.zip

PS: as to "testing it out first, with a wide variety of tools...", IMHO, unlikely we could ever satisfy all the tools that might also have bugs... As long as we have valid OWL model (written as RDF/XML), we should be good - others can transform/postfix/mixin the data for their needs. Unfortunately, we also had to write lots of code to semantically validate and "clean" existing BioPAX data that we integrate/merge into Pathway Commons model (problems include wrong use of biopax properties, bad/nonsense/misspelled identifier collection names (in xref.db), unification xrefs attached to wrong type of individuals/objects, etc...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants