You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
if the file's encoding is US-ASCII or UTF-8, then just use the String constructor
if the byte array contains only 7-bit data, then use String constructor
use decode() function from StringEncodings.jl to do the conversion
There are two issues:
I'm uncertain if non-Latin1 character sets may have 7-bit data that's incompatible with ascii. It that's the case then the decoded string may look scrambled.
The decode() function creates a new IOBuffer and closes it for every call. This is very inefficient for our purpose; instead, we can pre-create the buffer and StringDecoder object and reuse them for the entire file-read operation.
The decode() function creates a new IOBuffer and closes it for every call. This is very inefficient for our purpose; instead, we can pre-create the buffer and StringDecoder object and reuse them for the entire file-read operation.
It's interesting that string parsing seems to suck a lot during my initial testing with the extr.sas7bdat file. As it turns out, the reason why it's much slower is because the string columns are padded with a lot of spaces (SAS7BDAT has fixed width char columns).
In this release, space characters (0x20) are stripped off from the byte array before passing over to the decoder and so the parsing performance dramatically improves in this test scenario -- approx 15-20x. The downside of this approach is that 0x20 may actually be part of a char encoding that does not represent space e.g.
julia> decode([0x02, 0x20], "UTF-16")
"Ƞ"
I decided to keep this change, however, due to the fact that we can only recognize some of the common encodings from SAS anyways, and they all seem to work fine with this convention. The benefit of performance improvement outweighs the odd cases of uncommon file encodings. Note that ReadStat C-lib employ the same strategy as well.
The current parser is very simple:
There are two issues:
It may worth a little wait for the Strs.jl package. See discussions at:
https://discourse.julialang.org/t/string-encodings-help/8188/8
The text was updated successfully, but these errors were encountered: