Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ByteIO with StringIO #1222

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

white-gecko
Copy link
Member

Bring #1183 further. Use string instead of bytes for the serialization.

Some arguments:

  • In many cases strings are created, which are then encoded to byte arrays. I can't imagine that byte arrays are faster in this case then strings.
  • files can be opened with the argument encoding open("file.ttl", encoding="latin-1") which allows users to control the encoding in their code, if necessary.
  • Turtle, TriG, N-Triples, N-Quads are all specified as utf-8 documents which is the default for python3 anyways
  • For N-Triples there is the backward compatibility to use ASCII symbols only.
  • Currently there is some inconsistency with the encoding handling in the code (e.g.
    stream.write("\n".encode("latin-1"))
    )
  • Less code

Proposed Changes

  • Use StringIO instead of ByteIO
  • remove the specification of the encoding parameter from serialization methods.

The test currently fail, there still needs to be some adjustment.

@FlorianLudwig
Copy link
Contributor

I do like the simplifications. Is the plan to remove the encoding= flag from the serializers as well?

@white-gecko
Copy link
Member Author

I do like the simplifications. Is the plan to remove the encoding= flag from the serializers as well?

In general yes. But we have to see, how to deal with UCHARs \uxxx if a developer does not want to use UTF-8. But maybe this case is not needed if we just have everything UTF-8.

@rchateauneu
Copy link
Contributor

Also, this file could be open in "rb" mode:

def _create_input_source_from_location(file, format, input_source, location):
...
    if absolute_location.startswith("file:///"):
        filename = url2pathname(absolute_location.replace("file:///", "/"))
        file = open(filename, "r")

It allows to suppress conversion to str. Please note that deserializing "orkg.nt" in "nt" mode uses more than 12% of CPU (plus memory fragmentation of course). Profiling is attached, where this overhead is visible:

prof

Because existing regular expressions are "str" and not "bytes", it would probably be simpler to process everything in "str" mode by default. Conversions would take place only if the input is "bytes" or any other rare case. I tested it, and it saves about 10 or 15%, as expected. Please note that it should work for all deserializations from files.

There is just one corner case to take into account, when the input file is prefixed by a BOM marker (This is tested). This can be fixed for example like that:

        filename = url2pathname(absolute_location.replace("file:///", "/"))
        file = open(filename, "r")
        # Beware of utf-8 BOM unicode "feff". If one is detected,
        if sys.platform == "win32":
            bom = file.read(3)
            # codecs.BOM_UTF8= b'\xef\xbb\xbf'
            if bom != "\xef\xbb\xbf":
                file.seek(0)
        else:
            bom = file.read(1)
            if ord(bom) != 0xfeff:
                file.seek(0)

@white-gecko
Copy link
Member Author

The tests currently fail, I think some things need to be adjusted before this can be merged.

@nicholascar
Copy link
Member

@white-gecko are you still interested in continuing with this PR?

@white-gecko
Copy link
Member Author

Yes.

@white-gecko
Copy link
Member Author

But I'm also very happy with help.

@rchateauneu
Copy link
Contributor

But I'm also very happy with help.

I tried an approach (but stopped it because I did not want to break too many things) which was:

  • I assumed that the most important bottleneck, is reading from a flat file to a graph.
  • Then I removed all conversions and adapters from the opening of the file to reading it, assuming "str" everywhere. Some adjustments are possible with "open()" parameters: terminators "\n" or "\r\n". Ensure that, for N3, lines are read in one go: This makes everything much simpler and faster, because there was no conversion nor processing between readline() and re.match().
  • A corner case is the byte order mark at the beginning of some Windows files: What I did (as far as I remember) was something like to open the file in bytes mode to check the presence of two first bytes, then reopen in text mode and seek(2) if needed.

@white-gecko white-gecko added the help wanted Extra attention is needed label Jul 17, 2021
@aucampia
Copy link
Member

This may be related #1418

I think using StringIO may be better in most cases, but this will basically break the interface so it should be deffered to version 7. Maybe we can make everything work with StringIO only.

@nicholascar
Copy link
Member

@white-gecko can you please retest to see if recent PR merges have affected this PR?

@nicholascar
Copy link
Member

@white-gecko time moves on and we've had plenty of new PRs since my last comment! Do you want to return to this PR or should we just close it?

@rchateauneu
Copy link
Contributor

Most of these errors are strings converted with .encode("latin-1") which probably just need to be removed.

@nicholascar
Copy link
Member

Perhaps a few of the .encode("latin-1") statements may have been removed due to other updates to the serializers. Perhaps also if @white-gecko isn't available then we can create a new PR with this content since it's not large. I'll give him a week!

@aucampia
Copy link
Member

I think there is some need to rationalize our IO, and just having one of BytesIO and StringIO may make sense, I will look at this patch maybe next week to see what I can do.

@ghost
Copy link

ghost commented Jan 17, 2022

I think there is some need to rationalize our IO, and just having one of BytesIO and StringIO may make sense, I will look at this patch maybe next week to see what I can do.

In the hope of saving you some time ...

I had a go at this as a response to the "help wanted" label - I made some very limited progress.

It rebases okay and as well as handful of encodes to be removed there are a few apparently-obvious type changes to make (from IO(bytes) to IO(str)). I started with test_finalnewline as it iterates over the serializers. However, I ran out of steam wrestling with Exceptions arising from XMLWriter. Got this far

@ghost ghost added the 7.0 Changes planned for version 7 label Jun 8, 2022
@aucampia aucampia added breaking change This involves or proposes breaking RDFLib's public API. core Relates to core functionality of RDFLib, i.e. `rdflib.{graph,store,term}` and removed 7.0 Changes planned for version 7 labels May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change This involves or proposes breaking RDFLib's public API. core Relates to core functionality of RDFLib, i.e. `rdflib.{graph,store,term}` help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants