Better string parser #17

tk3369 · 2018-01-06T18:24:37Z

The current parser is very simple:

if the file's encoding is US-ASCII or UTF-8, then just use the String constructor
if the byte array contains only 7-bit data, then use String constructor
use decode() function from StringEncodings.jl to do the conversion

There are two issues:

I'm uncertain if non-Latin1 character sets may have 7-bit data that's incompatible with ascii. It that's the case then the decoded string may look scrambled.
The decode() function creates a new IOBuffer and closes it for every call. This is very inefficient for our purpose; instead, we can pre-create the buffer and StringDecoder object and reuse them for the entire file-read operation.

It may worth a little wait for the Strs.jl package. See discussions at:
https://discourse.julialang.org/t/string-encodings-help/8188/8

tk3369 · 2018-01-13T20:48:10Z

Release v0.4.1 addresses this item partially.

This part is done:

The decode() function creates a new IOBuffer and closes it for every call. This is very inefficient for our purpose; instead, we can pre-create the buffer and StringDecoder object and reuse them for the entire file-read operation.

It's interesting that string parsing seems to suck a lot during my initial testing with the extr.sas7bdat file. As it turns out, the reason why it's much slower is because the string columns are padded with a lot of spaces (SAS7BDAT has fixed width char columns).

In this release, space characters (0x20) are stripped off from the byte array before passing over to the decoder and so the parsing performance dramatically improves in this test scenario -- approx 15-20x. The downside of this approach is that 0x20 may actually be part of a char encoding that does not represent space e.g.

julia> decode([0x02, 0x20], "UTF-16")
"Ƞ"

I decided to keep this change, however, due to the fact that we can only recognize some of the common encodings from SAS anyways, and they all seem to work fine with this convention. The benefit of performance improvement outweighs the odd cases of uncommon file encodings. Note that ReadStat C-lib employ the same strategy as well.

tk3369 mentioned this issue Jan 6, 2018

Performance comparison with ReadStat.jl #10

Closed

tk3369 added enhancement and removed enhancement labels Jan 7, 2018

tk3369 self-assigned this Jan 10, 2018

tk3369 mentioned this issue Jan 13, 2018

performance pack #19

Merged

tk3369 closed this as completed Jan 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better string parser #17

Better string parser #17

tk3369 commented Jan 6, 2018 •

edited

Loading

tk3369 commented Jan 13, 2018

Better string parser #17

Better string parser #17

Comments

tk3369 commented Jan 6, 2018 • edited Loading

tk3369 commented Jan 13, 2018

tk3369 commented Jan 6, 2018 •

edited

Loading