Replace ByteIO with StringIO #1222

white-gecko · 2020-12-29T13:29:41Z

Bring #1183 further. Use string instead of bytes for the serialization.

Some arguments:

In many cases strings are created, which are then encoded to byte arrays. I can't imagine that byte arrays are faster in this case then strings.
files can be opened with the argument encoding open("file.ttl", encoding="latin-1") which allows users to control the encoding in their code, if necessary.
Turtle, TriG, N-Triples, N-Quads are all specified as utf-8 documents which is the default for python3 anyways
For N-Triples there is the backward compatibility to use ASCII symbols only.
Currently there is some inconsistency with the encoding handling in the code (e.g.

rdflib/rdflib/plugins/serializers/turtle.py

Line 253 in e1da955

stream.write("\n".encode("latin-1"))

)
Less code

Proposed Changes

Use StringIO instead of ByteIO
remove the specification of the encoding parameter from serialization methods.

The test currently fail, there still needs to be some adjustment.

FlorianLudwig · 2020-12-29T20:31:50Z

I do like the simplifications. Is the plan to remove the encoding= flag from the serializers as well?

white-gecko · 2021-02-05T07:47:41Z

I do like the simplifications. Is the plan to remove the encoding= flag from the serializers as well?

In general yes. But we have to see, how to deal with UCHARs \uxxx if a developer does not want to use UTF-8. But maybe this case is not needed if we just have everything UTF-8.

rchateauneu · 2021-05-06T17:51:10Z

Also, this file could be open in "rb" mode:

def _create_input_source_from_location(file, format, input_source, location):
...
    if absolute_location.startswith("file:///"):
        filename = url2pathname(absolute_location.replace("file:///", "/"))
        file = open(filename, "r")

It allows to suppress conversion to str. Please note that deserializing "orkg.nt" in "nt" mode uses more than 12% of CPU (plus memory fragmentation of course). Profiling is attached, where this overhead is visible:

Because existing regular expressions are "str" and not "bytes", it would probably be simpler to process everything in "str" mode by default. Conversions would take place only if the input is "bytes" or any other rare case. I tested it, and it saves about 10 or 15%, as expected. Please note that it should work for all deserializations from files.

There is just one corner case to take into account, when the input file is prefixed by a BOM marker (This is tested). This can be fixed for example like that:

        filename = url2pathname(absolute_location.replace("file:///", "/"))
        file = open(filename, "r")
        # Beware of utf-8 BOM unicode "feff". If one is detected,
        if sys.platform == "win32":
            bom = file.read(3)
            # codecs.BOM_UTF8= b'\xef\xbb\xbf'
            if bom != "\xef\xbb\xbf":
                file.seek(0)
        else:
            bom = file.read(1)
            if ord(bom) != 0xfeff:
                file.seek(0)

white-gecko · 2021-05-06T18:24:52Z

The tests currently fail, I think some things need to be adjusted before this can be merged.

nicholascar · 2021-07-02T13:44:28Z

@white-gecko are you still interested in continuing with this PR?

white-gecko · 2021-07-16T12:16:08Z

Yes.

white-gecko · 2021-07-16T15:30:21Z

But I'm also very happy with help.

rchateauneu · 2021-07-16T15:47:43Z

But I'm also very happy with help.

I tried an approach (but stopped it because I did not want to break too many things) which was:

I assumed that the most important bottleneck, is reading from a flat file to a graph.
Then I removed all conversions and adapters from the opening of the file to reading it, assuming "str" everywhere. Some adjustments are possible with "open()" parameters: terminators "\n" or "\r\n". Ensure that, for N3, lines are read in one go: This makes everything much simpler and faster, because there was no conversion nor processing between readline() and re.match().
A corner case is the byte order mark at the beginning of some Windows files: What I did (as far as I remember) was something like to open the file in bytes mode to check the presence of two first bytes, then reopen in text mode and seek(2) if needed.

aucampia · 2021-10-17T13:51:56Z

This may be related #1418

I think using StringIO may be better in most cases, but this will basically break the interface so it should be deffered to version 7. Maybe we can make everything work with StringIO only.

nicholascar · 2021-11-21T10:26:56Z

@white-gecko can you please retest to see if recent PR merges have affected this PR?

nicholascar · 2022-01-16T06:57:38Z

@white-gecko time moves on and we've had plenty of new PRs since my last comment! Do you want to return to this PR or should we just close it?

rchateauneu · 2022-01-17T06:02:23Z

Most of these errors are strings converted with .encode("latin-1") which probably just need to be removed.

nicholascar · 2022-01-17T06:11:41Z

Perhaps a few of the .encode("latin-1") statements may have been removed due to other updates to the serializers. Perhaps also if @white-gecko isn't available then we can create a new PR with this content since it's not large. I'll give him a week!

aucampia · 2022-01-17T07:47:51Z

I think there is some need to rationalize our IO, and just having one of BytesIO and StringIO may make sense, I will look at this patch maybe next week to see what I can do.

ghost · 2022-01-17T12:05:09Z

I think there is some need to rationalize our IO, and just having one of BytesIO and StringIO may make sense, I will look at this patch maybe next week to see what I can do.

In the hope of saving you some time ...

I had a go at this as a response to the "help wanted" label - I made some very limited progress.

It rebases okay and as well as handful of encodes to be removed there are a few apparently-obvious type changes to make (from IO(bytes) to IO(str)). I started with test_finalnewline as it iterates over the serializers. However, I ran out of steam wrestling with Exceptions arising from XMLWriter. Got this far

Replace ByteIO with StringIO

356ca6b

white-gecko mentioned this pull request Feb 5, 2021

Modernize ntriples serializer #1245

Closed

white-gecko mentioned this pull request Feb 19, 2021

handle encoding of base64Binary Literals #1258

Merged

white-gecko mentioned this pull request May 6, 2021

Speedup NTriples parser #1297

Closed

rchateauneu mentioned this pull request May 21, 2021

Speedup ntriples parser #1300

Closed

white-gecko added the help wanted Extra attention is needed label Jul 17, 2021

white-gecko mentioned this pull request Jul 23, 2021

NTSerializer only encodes in ASCII #1377

Closed

white-gecko mentioned this pull request Oct 2, 2021

A new Turtle serializer #1425

Merged

Merge branch 'master' into feature/StringIO

77115bd

ghost added the 7.0 Changes planned for version 7 label Jun 8, 2022

white-gecko added this to the 2022 October release (7.0.0) milestone Jun 20, 2022

aucampia added breaking change This involves or proposes breaking RDFLib's public API. core Relates to core functionality of RDFLib, i.e. `rdflib.{graph,store,term}` and removed 7.0 Changes planned for version 7 labels May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace ByteIO with StringIO #1222

Replace ByteIO with StringIO #1222

white-gecko commented Dec 29, 2020

FlorianLudwig commented Dec 29, 2020

white-gecko commented Feb 5, 2021

rchateauneu commented May 6, 2021

white-gecko commented May 6, 2021

nicholascar commented Jul 2, 2021

white-gecko commented Jul 16, 2021

white-gecko commented Jul 16, 2021

rchateauneu commented Jul 16, 2021

aucampia commented Oct 17, 2021

nicholascar commented Nov 21, 2021

nicholascar commented Jan 16, 2022

rchateauneu commented Jan 17, 2022

nicholascar commented Jan 17, 2022

aucampia commented Jan 17, 2022

ghost commented Jan 17, 2022

Replace ByteIO with StringIO #1222

Are you sure you want to change the base?

Replace ByteIO with StringIO #1222

Conversation

white-gecko commented Dec 29, 2020

Proposed Changes

FlorianLudwig commented Dec 29, 2020

white-gecko commented Feb 5, 2021

rchateauneu commented May 6, 2021

white-gecko commented May 6, 2021

nicholascar commented Jul 2, 2021

white-gecko commented Jul 16, 2021

white-gecko commented Jul 16, 2021

rchateauneu commented Jul 16, 2021

aucampia commented Oct 17, 2021

nicholascar commented Nov 21, 2021

nicholascar commented Jan 16, 2022

rchateauneu commented Jan 17, 2022

nicholascar commented Jan 17, 2022

aucampia commented Jan 17, 2022

ghost commented Jan 17, 2022