-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
safer vector<->string conversions, fixing #24388 #25241
Conversation
base/strings/io.jl
Outdated
IOBuffer(str::String) = IOBuffer(Vector{UInt8}(str)) | ||
IOBuffer(s::SubString{String}) = IOBuffer(view(Vector{UInt8}(s.string), s.offset + 1 : s.offset + sizeof(s))) | ||
IOBuffer(str::String) = IOBuffer(unsafe_wrap(Vector{UInt8},str)) | ||
IOBuffer(s::SubString{String}) = IOBuffer(view(unsafe_wrap(Vector{UInt8},s.string), s.offset + 1 : s.offset + sizeof(s))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like it would make more sense to use a StringBytes
here
base/strings/io.jl
Outdated
@@ -373,7 +373,10 @@ function unescape_string(io, s::AbstractString) | |||
end | |||
end | |||
|
|||
macro b_str(s); :(Vector{UInt8}($(unescape_string(s)))); end | |||
macro b_str(s) | |||
v = unsafe_wrap(Vector{UInt8}, unescape_string(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't performance sensitive to run the macro, but this would probably generate better runtime performance if it was making an actual copy here.
base/strings/string.jl
Outdated
unsafe_wrap(::Type{Vector{UInt8}}, s::String) = ccall(:jl_string_to_array, Ref{Vector{UInt8}}, (Any,), s) | ||
|
||
struct StringBytes <: AbstractVector{UInt8} | ||
s::String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this were parameterized as S <: AbstractString
, this would likely be helpful for implementing readuntil
(I know, somewhat random connection, but I had been strongly considering creating this type – whereas right now we typically opt for using a special-case to access bytes of String and convert everything else to a Char array)
base/strings/string.jl
Outdated
next(s::StringBytes, i) = (@_propagate_inbounds_meta; (s[i], i+1)) | ||
done(s::StringBytes, i) = (@_inline_meta; i == length(s)+1) | ||
|
||
copy(s::StringBytes) = copyto!(Vector{UInt8}(uninitialized, length(s)), s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we copying immutable data? Also, seems to return the wrong type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should probably be a method of Vector{UInt8}
. Though there's some precedent for returning a different type e.g. when copying views.
base/strings/string.jl
Outdated
copy(s::StringBytes) = copyto!(Vector{UInt8}(uninitialized, length(s)), s) | ||
|
||
unsafe_convert(::Type{Ptr{UInt8}}, s::StringBytes) = convert(Ptr{UInt8}, pointer(s.s)) | ||
unsafe_convert(::Type{Ptr{Int8}}, s::StringBytes) = convert(Ptr{Int8}, pointer(s.s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
= unsafe_convert(Ptr{Int8}, s.s)
and drop the definitions of pointer
(why does this function still exist anyways?)
base/strings/util.jl
Outdated
hex2bytes(s::Union{String,AbstractVector{UInt8}}) = hex2bytes!(Vector{UInt8}(uninitialized, length(s) >> 1), s) | ||
|
||
_nbytes(s::String) = sizeof(s) | ||
_nbytes(s::AbstractVector{UInt8}) = length(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this also sizeof
? I thought we define that sizeof
means nbytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was being super paranoid, but you're probably right.
base/c.jl
Outdated
transcode(::Type{String}, src) = String(transcode(UInt8, src)) | ||
|
||
function transcode(::Type{UInt16}, src::Vector{UInt8}) | ||
function transcode(::Type{UInt16}, src::AbstractVector{UInt8}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function assumes that axes(src) === 1:length(src)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely it's not the only function like that... should I use Union{Vector{UInt8},StringBytes}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, let's keep this as specific as possible – we can always generalize as needed.
+1 for StringBytes being exported! For the deprecation, you could deprecate to |
ebb7d86
to
494705e
Compare
494705e
to
76b1330
Compare
@@ -188,8 +188,8 @@ julia> String(take!(io)) | |||
"Haho" | |||
``` | |||
""" | |||
IOBuffer(str::String) = IOBuffer(Vector{UInt8}(str)) | |||
IOBuffer(s::SubString{String}) = IOBuffer(view(Vector{UInt8}(s.string), s.offset + 1 : s.offset + sizeof(s))) | |||
IOBuffer(str::String) = IOBuffer(unsafe_wrap(Vector{UInt8}, str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably also IOBuffer(s::StringBytes) = IOBuffer(s.s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a different question but related to the same line. Why unsafe_wrap
is used here - for efficiency? I am asking, because it seems that this should work with StringBytes
equally well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using StringBytes
and a few things broke that I didn't feel like dealing with. The type IOBuffer
actually implies that a Vector{UInt8}
will be used:
julia> IOBuffer
Base.GenericIOBuffer{Array{UInt8,1}}
so returning a GenericIOBuffer{StringBytes}
instead can cause surprises.
base/strings/string.jl
Outdated
Wrap a `String` (without copying) in an immutable vector-like object that accesses the bytes | ||
of the string's representation. | ||
""" | ||
struct StringBytes <: AbstractVector{UInt8} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<: DenseVector{UInt8}
, and include strides
methods?
76b1330
to
1cae074
Compare
@StefanKarpinski @bkamins Approve? |
Thanks! I understand the target functionality is:
I think it is very good and |
We might want to hold off on exporting |
If you want me to, I could make a PR where |
|
We're pretty good at eliding allocations these days, so I suspect it'd be fine, but I'd recommend verifying in the cases you're interested in. One thing in particular is that we can only elide allocations if both the allocation and all its uses are inlined into one function. |
If so, it seems like the name is wrong. Either it would be specifically a |
Yes, clearly the name |
I think we don't want string types to have to define I can rename this |
I think if the type could be called |
add `CodeUnits` and `codeunits` fixes #24388
1cae074
to
7bce3b1
Compare
OK, see how this grabs you. |
let v = unsafe_wrap(Vector{UInt8}, "abc") | ||
s = String(v) | ||
@test_throws BoundsError v[1] | ||
push!(v, UInt8('x')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Why doesn't this line throw an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to. When push!
is called new storage is allocated for the vector (also as a String
in case you want to make more strings from it).
@@ -60,7 +60,11 @@ This representation is often appropriate for passing strings to C. | |||
String(s::AbstractString) = print_to_string(s) | |||
String(s::Symbol) = unsafe_string(unsafe_convert(Ptr{UInt8}, s)) | |||
|
|||
(::Type{Vector{UInt8}})(s::String) = ccall(:jl_string_to_array, Ref{Vector{UInt8}}, (Any,), s) | |||
unsafe_wrap(::Type{Vector{UInt8}}, s::String) = ccall(:jl_string_to_array, Ref{Vector{UInt8}}, (Any,), s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have unsafe_wrap(::Type{Array}, s) = unsafe_wrap(Vector{UInt8}, s)
?
This looks good to me. I guess the wrapping of string object itself is fairly essential here, which inherently couples the |
What's the difference? This should work for anything that implements |
Just that |
I tried a couple things here. First, I decided that we will still want some way to wrap a String as a Vector in-place, so I made that a method of
unsafe_wrap
as suggested by @vtjnash. But then, there are many uses that just want to access the bytes of a string and don't do anything unsafe. To handle that I added aStringBytes
CodeUnits
AbstractVector type that exposes the bytes of aString
as an immutable UInt8 vector.My questions are (1) do we want
CodeUnits
and should it be exported, and (2) what should the existing functions be deprecated to? I feel the majority of uses will wantCodeUnits
, but that isn't an accurate deprecation since it doesn't return aVector{UInt8}
.