Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect turtle serialization of decimal values #1043

Closed
saggu opened this issue May 9, 2020 · 13 comments
Closed

Incorrect turtle serialization of decimal values #1043

saggu opened this issue May 9, 2020 · 13 comments

Comments

@saggu
Copy link

saggu commented May 9, 2020

I am using rdflib to serialize to ttl files, rdflib = 5.0.0

Here is a sample output,

wd:Q29722949 a wikibase:Item ;
    p:P2020013 wds:Q29722949-9ef7c6be-2c5e-4c64-8e44-a3af6725596e ;
    p:P2020014 wds:Q29722949-5570794f-5918-45b8-936f-9a2e24a6a61d ;
    p:P2020015 wds:Q29722949-62e5de90-74ab-47bd-a8b5-13323986c20b ;
    p:P2020016 wds:Q29722949-daa7b2f2-3646-4e78-bdb3-eb01c745d502 ;
    p:P2020017 wds:Q29722949-0e7cb2f1-a13b-47a0-b6a9-c24908c0edcf ;
    wdtn:P2020013 wdv:Quantityc1c0c0c0 ;
    wdtn:P2020014 wdv:Quantityc0c0c0c0 ;
    wdtn:P2020015 wdv:Quantityc0-000000216c0c0c0 ;
    wdtn:P2020016 wdv:Quantityc0c0c0c0 ;
    wdtn:P2020017 wdv:Quantityc0-0000000000000005c0c0c0 ;
    wdt:P2020013 1.0 ;
    wdt:P2020014 0.0 ;
    wdt:P2020015 2.16E-7 ;
    wdt:P2020016 0.0 ;
    wdt:P2020017 5E-16.0 .

notice that the line wdt:P2020015 2.16E-7 is correct while, wdt:P2020017 5E-16.0 is incorrect as it has a .0 at the end, which is invalid.

Please take a look, this seems to be happening randomly

@saggu
Copy link
Author

saggu commented May 11, 2020

I did some digging and this is not random, the https://github.com/RDFLib/rdflib/blob/master/rdflib/term.py#L1243 is the culprit here.
I used type DECIMAL and since there is no . in the literal 5E-16, the code is appending a .0 in the end

@tgbugs
Copy link
Contributor

tgbugs commented May 11, 2020

If the datatype on 5E-16 is actually xsd:decimal then it should not be serialized in scientific notation anyway according to the spec. If it is a double or a float, then both 5E-16.0 and 5E-16 are correct according to my reading of the spec.

@saggu
Copy link
Author

saggu commented May 11, 2020

Here is what the link you posted says about xsd:double

Double-precision floating point values may be written as an optionally
signed mantissa with an optional decimal point, the letter "e" or "E", 
and an optionally signed integer exponent. The exponent matches the 
regular expression "[+-]?[0-9]+" and the mantissa one of these regular 
expressions: "[+-]?[0-9]+\.[0-9]+", "[+-]?\.[0-9]+" or "[+-]?[0-9]".

The exponent matches the regular expression "[+-]?[0-9]+"
, so it seems 5E-16.0 is not valid.

Anyway, we are talking about python, where it chooses to represent a number in scientific notation if it has more than 4 zeroes after the decimal. irrespective of whether we cast it as float or decimal or whatever.
Simple solution would be to check if there is an e|E in the case even when the type is xsd:decimal.

I can create a PR when I have time to spare

@saggu
Copy link
Author

saggu commented May 11, 2020

Or, just not try to produce a "pretty" output, will also work :)

@tgbugs
Copy link
Contributor

tgbugs commented May 11, 2020

Right. I see the issue now. I would say that the "don't make it pretty" option is probably the right approach, and we should make sure that python Decimal objects or things explicitly marked with xsd:decimal are not converted to floating point.

@nicholascar
Copy link
Member

Is anyone here going to work on this or is the a wontfix?

@tgbugs
Copy link
Contributor

tgbugs commented May 15, 2020

I do not expect to have bandwidth to work on it any time soon but it definitely needs to be fixed. Also I think is part of a general issue around serialization of literals.

@achaudhary997
Copy link
Contributor

So I was looking into this issue from the past couple of days.
The following test case shows the above serialization issue.

g = Graph()
g.bind('xsd', XSD)
g.bind('rdfs', RDFS)
n = Namespace("http://example.org/")
g.add((n.number, RDFS.label, Literal(0.00000004, datatype=XSD.decimal)))

Since the _castPythonToLiteral function checks the type of input value (which would be float in the above e.g. Therefore it does no conversion as per the _GenericPythonToXSDRules and simply stores it as (float, XSD.decimal).

If I change the last line to

g.add((n.number, RDFS.label, Literal(decimal.Decimal(0.00000004), datatype=XSD.decimal)))

The conversion works properly the number is indeed stored as Decimal but the representation is no longer pretty (has a large number of non-zero decimal points).

As of now, I have thought of the below two ways of fixing the serialization bug.

  1. to check if there is an e|E in the case even when the type is xsd:decimal (as suggested by saggu)
  2. to not try and make it pretty, and return the representation as it is.
  3. to add a check if the user wants to store float as a decimal and make the appropriate conversion in the code. (this will lead to the output having a large amount of non-zero decimal points due to the way decimals are stored).

I have also added a test case for this issue, and the first two solutions are ready to go (open to suggestions as to which one will be better to push). Want to know if there are some other suggestions or should I prefer the 3rd option over the first 2?

@saggu
Copy link
Author

saggu commented May 19, 2020

Right. I see the issue now. I would say that the "don't make it pretty" option is probably the right approach, and we should make sure that python Decimal objects or things explicitly marked with xsd:decimal are not converted to floating point.

this is what would be the best solution, according to @tgbugs .
@achaudhary997

@ashleysommer
Copy link
Contributor

ashleysommer commented May 27, 2020

I don't think theres really a correct answer for this.
Constructing a rdflib Literal with the python float (double) value 0.000004 and telling the Literal to use datatype XSD.Decimal is really asking for trouble. A python float and decimal are totally different things and serialize to strings differently. No amount of logic can resolve that discrepancy while behaving in a manner that all users would find predictable.

Personally I think it best to avoid this (and how I do it in my projects) is as @achaudhary997 wrote:
Literal(decimal.Decimal('0.00000004'), datatype=XSD.decimal)
That is; when you want a Literal with datatype=XSD.decimal, then populate it with a Decimal value, not a float.

As for this bug, I believe number option 1 above is the best way forward. Still try to format it as before, but if it has e or E in it, then don't append the .0.

@nicholascar
Copy link
Member

when you want a Literal with datatype=XSD.decimal, then populate it with a Decimal value, not a float.

Agree! Right back to the early designs of RDFlib the idea has always been that you should be aiming to do the right thing (you can always make up edge cases that break any software!) so use the appropriate Python Decimal() if you want RDF's xsd:decimal.

@achaudhary997
Copy link
Contributor

achaudhary997 commented May 27, 2020

As of now the pull request (not merged yet) just returns the representation on an as is basis.
Should I change it to check for e|E in the representation before making it pretty? Or let it be?

jesper-friis added a commit to emmo-repo/EMMOntoPy that referenced this issue Jan 3, 2021
rdflib prior to 5.0.0 serialises real literals syntactically wrong in turtle.
See RDFLib/rdflib#1043.
@nicholascar
Copy link
Member

@CraigMiloRogers are you saying that "1.9860001065575846E-7" is incorrect (looks fine to me) or are you perhaps just saying that there is inconsistency? We may not be able to deal with inconsistency, but we would hope that all different forms are, themselves, correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants