-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace ByteIO with StringIO #1222
base: main
Are you sure you want to change the base?
Conversation
I do like the simplifications. Is the plan to remove the |
In general yes. But we have to see, how to deal with UCHARs |
Also, this file could be open in "rb" mode:
It allows to suppress conversion to str. Please note that deserializing "orkg.nt" in "nt" mode uses more than 12% of CPU (plus memory fragmentation of course). Profiling is attached, where this overhead is visible: Because existing regular expressions are "str" and not "bytes", it would probably be simpler to process everything in "str" mode by default. Conversions would take place only if the input is "bytes" or any other rare case. I tested it, and it saves about 10 or 15%, as expected. Please note that it should work for all deserializations from files. There is just one corner case to take into account, when the input file is prefixed by a BOM marker (This is tested). This can be fixed for example like that:
|
The tests currently fail, I think some things need to be adjusted before this can be merged. |
@white-gecko are you still interested in continuing with this PR? |
Yes. |
But I'm also very happy with help. |
I tried an approach (but stopped it because I did not want to break too many things) which was:
|
This may be related #1418 I think using StringIO may be better in most cases, but this will basically break the interface so it should be deffered to version 7. Maybe we can make everything work with StringIO only. |
@white-gecko can you please retest to see if recent PR merges have affected this PR? |
@white-gecko time moves on and we've had plenty of new PRs since my last comment! Do you want to return to this PR or should we just close it? |
Most of these errors are strings converted with |
Perhaps a few of the |
I think there is some need to rationalize our IO, and just having one of BytesIO and StringIO may make sense, I will look at this patch maybe next week to see what I can do. |
In the hope of saving you some time ... I had a go at this as a response to the "help wanted" label - I made some very limited progress. It rebases okay and as well as handful of |
Bring #1183 further. Use string instead of bytes for the serialization.
Some arguments:
open("file.ttl", encoding="latin-1")
which allows users to control the encoding in their code, if necessary.rdflib/rdflib/plugins/serializers/turtle.py
Line 253 in e1da955
Proposed Changes
encoding
parameter from serialization methods.The test currently fail, there still needs to be some adjustment.