Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better string parser #17

Closed
tk3369 opened this issue Jan 6, 2018 · 1 comment
Closed

Better string parser #17

tk3369 opened this issue Jan 6, 2018 · 1 comment
Assignees

Comments

@tk3369
Copy link
Owner

tk3369 commented Jan 6, 2018

The current parser is very simple:

  1. if the file's encoding is US-ASCII or UTF-8, then just use the String constructor
  2. if the byte array contains only 7-bit data, then use String constructor
  3. use decode() function from StringEncodings.jl to do the conversion

There are two issues:

  1. I'm uncertain if non-Latin1 character sets may have 7-bit data that's incompatible with ascii. It that's the case then the decoded string may look scrambled.
  2. The decode() function creates a new IOBuffer and closes it for every call. This is very inefficient for our purpose; instead, we can pre-create the buffer and StringDecoder object and reuse them for the entire file-read operation.

It may worth a little wait for the Strs.jl package. See discussions at:
https://discourse.julialang.org/t/string-encodings-help/8188/8

@tk3369
Copy link
Owner Author

tk3369 commented Jan 13, 2018

Release v0.4.1 addresses this item partially.

This part is done:

  1. The decode() function creates a new IOBuffer and closes it for every call. This is very inefficient for our purpose; instead, we can pre-create the buffer and StringDecoder object and reuse them for the entire file-read operation.

It's interesting that string parsing seems to suck a lot during my initial testing with the extr.sas7bdat file. As it turns out, the reason why it's much slower is because the string columns are padded with a lot of spaces (SAS7BDAT has fixed width char columns).

In this release, space characters (0x20) are stripped off from the byte array before passing over to the decoder and so the parsing performance dramatically improves in this test scenario -- approx 15-20x. The downside of this approach is that 0x20 may actually be part of a char encoding that does not represent space e.g.

julia> decode([0x02, 0x20], "UTF-16")
"Ƞ"

I decided to keep this change, however, due to the fact that we can only recognize some of the common encodings from SAS anyways, and they all seem to work fine with this convention. The benefit of performance improvement outweighs the odd cases of uncommon file encodings. Note that ReadStat C-lib employ the same strategy as well.

@tk3369 tk3369 closed this as completed Jan 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant