Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safer vector<->string conversions, fixing #24388 #25241

Merged
merged 3 commits into from
Jan 3, 2018
Merged

Conversation

JeffBezanson
Copy link
Member

@JeffBezanson JeffBezanson commented Dec 22, 2017

I tried a couple things here. First, I decided that we will still want some way to wrap a String as a Vector in-place, so I made that a method of unsafe_wrap as suggested by @vtjnash. But then, there are many uses that just want to access the bytes of a string and don't do anything unsafe. To handle that I added a StringBytes CodeUnits AbstractVector type that exposes the bytes of a String as an immutable UInt8 vector.

My questions are (1) do we want CodeUnits and should it be exported, and (2) what should the existing functions be deprecated to? I feel the majority of uses will want CodeUnits, but that isn't an accurate deprecation since it doesn't return a Vector{UInt8}.

@JeffBezanson JeffBezanson added the strings "Strings!" label Dec 22, 2017
IOBuffer(str::String) = IOBuffer(Vector{UInt8}(str))
IOBuffer(s::SubString{String}) = IOBuffer(view(Vector{UInt8}(s.string), s.offset + 1 : s.offset + sizeof(s)))
IOBuffer(str::String) = IOBuffer(unsafe_wrap(Vector{UInt8},str))
IOBuffer(s::SubString{String}) = IOBuffer(view(unsafe_wrap(Vector{UInt8},s.string), s.offset + 1 : s.offset + sizeof(s)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like it would make more sense to use a StringBytes here

@@ -373,7 +373,10 @@ function unescape_string(io, s::AbstractString)
end
end

macro b_str(s); :(Vector{UInt8}($(unescape_string(s)))); end
macro b_str(s)
v = unsafe_wrap(Vector{UInt8}, unescape_string(s))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't performance sensitive to run the macro, but this would probably generate better runtime performance if it was making an actual copy here.

unsafe_wrap(::Type{Vector{UInt8}}, s::String) = ccall(:jl_string_to_array, Ref{Vector{UInt8}}, (Any,), s)

struct StringBytes <: AbstractVector{UInt8}
s::String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this were parameterized as S <: AbstractString, this would likely be helpful for implementing readuntil (I know, somewhat random connection, but I had been strongly considering creating this type – whereas right now we typically opt for using a special-case to access bytes of String and convert everything else to a Char array)

next(s::StringBytes, i) = (@_propagate_inbounds_meta; (s[i], i+1))
done(s::StringBytes, i) = (@_inline_meta; i == length(s)+1)

copy(s::StringBytes) = copyto!(Vector{UInt8}(uninitialized, length(s)), s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we copying immutable data? Also, seems to return the wrong type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should probably be a method of Vector{UInt8}. Though there's some precedent for returning a different type e.g. when copying views.

copy(s::StringBytes) = copyto!(Vector{UInt8}(uninitialized, length(s)), s)

unsafe_convert(::Type{Ptr{UInt8}}, s::StringBytes) = convert(Ptr{UInt8}, pointer(s.s))
unsafe_convert(::Type{Ptr{Int8}}, s::StringBytes) = convert(Ptr{Int8}, pointer(s.s))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= unsafe_convert(Ptr{Int8}, s.s) and drop the definitions of pointer (why does this function still exist anyways?)

hex2bytes(s::Union{String,AbstractVector{UInt8}}) = hex2bytes!(Vector{UInt8}(uninitialized, length(s) >> 1), s)

_nbytes(s::String) = sizeof(s)
_nbytes(s::AbstractVector{UInt8}) = length(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this also sizeof? I thought we define that sizeof means nbytes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was being super paranoid, but you're probably right.

base/c.jl Outdated
transcode(::Type{String}, src) = String(transcode(UInt8, src))

function transcode(::Type{UInt16}, src::Vector{UInt8})
function transcode(::Type{UInt16}, src::AbstractVector{UInt8})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function assumes that axes(src) === 1:length(src)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely it's not the only function like that... should I use Union{Vector{UInt8},StringBytes}?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, let's keep this as specific as possible – we can always generalize as needed.

@iamed2
Copy link
Contributor

iamed2 commented Dec 22, 2017

+1 for StringBytes being exported!

For the deprecation, you could deprecate to unsafe_wrap but include a message in the deprecation suggesting StringBytes?

@@ -188,8 +188,8 @@ julia> String(take!(io))
"Haho"
```
"""
IOBuffer(str::String) = IOBuffer(Vector{UInt8}(str))
IOBuffer(s::SubString{String}) = IOBuffer(view(Vector{UInt8}(s.string), s.offset + 1 : s.offset + sizeof(s)))
IOBuffer(str::String) = IOBuffer(unsafe_wrap(Vector{UInt8}, str))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also IOBuffer(s::StringBytes) = IOBuffer(s.s)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a different question but related to the same line. Why unsafe_wrap is used here - for efficiency? I am asking, because it seems that this should work with StringBytes equally well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using StringBytes and a few things broke that I didn't feel like dealing with. The type IOBuffer actually implies that a Vector{UInt8} will be used:

julia> IOBuffer
Base.GenericIOBuffer{Array{UInt8,1}}

so returning a GenericIOBuffer{StringBytes} instead can cause surprises.

Wrap a `String` (without copying) in an immutable vector-like object that accesses the bytes
of the string's representation.
"""
struct StringBytes <: AbstractVector{UInt8}
Copy link
Member

@stevengj stevengj Dec 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<: DenseVector{UInt8}, and include strides methods?

@JeffBezanson
Copy link
Member Author

@StefanKarpinski @bkamins Approve?

@bkamins
Copy link
Member

bkamins commented Dec 31, 2017

Thanks! I understand the target functionality is:

  • convert(::Vector{UInt8}, ::String) is removed;
  • Vector{UInt8}(::String) copies data;
  • unsafe_wrap gives Vector{UInt8} giving access to String data;
  • StringBytes type gives a read-only version, the drawback is that it is only DenseVector{UInt8}.

I think it is very good and StringBytes should be exported. Deprecation approach was clear to me.

@StefanKarpinski
Copy link
Member

We might want to hold off on exporting StringBytes. It feels a bit ad hoc and like it could in the future be a special case of something more general. E.g. we were discussing making codeunits return just such an object and writing length(codeunits(s)) instead of ncodeunits(s) and eltype(codeunits(s)) instead of what is currently codeunit(s), etc. We'd want to be able to return an immutable vector-like type for that with different kinds of code units in it.

@StefanKarpinski
Copy link
Member

If you want me to, I could make a PR where codeunits(s) returns a StringBytes object instead.

@stevengj
Copy link
Member

length(codeunits(s)) seems unnecessarily expensive if it requires one to create an intermediate StringBytes object. Or is that cheap nowadays?

@StefanKarpinski
Copy link
Member

Or is that cheap nowadays?

Yes, that's my concern as well. I'm not sure if it's cheap yet. cc @Keno, @vtjnash

@Keno
Copy link
Member

Keno commented Dec 31, 2017

We're pretty good at eliding allocations these days, so I suspect it'd be fine, but I'd recommend verifying in the cases you're interested in. One thing in particular is that we can only elide allocations if both the allocation and all its uses are inlined into one function.

@vtjnash
Copy link
Member

vtjnash commented Dec 31, 2017

If you want me to, I could make a PR where codeunits(s) returns a StringBytes object instead.

If so, it seems like the name is wrong. Either it would be specifically a stringbytes(s) function (returning a byte iterator) or StringCodeUnits (representing iteration over code units).

@StefanKarpinski
Copy link
Member

Yes, clearly the name StringBytes would not be appropriate for a more generic type providing access to the code units of a string.

@JeffBezanson
Copy link
Member Author

I think we don't want string types to have to define codeunits, essentially because returning a container is more difficult than returning an integer. For example it's easy to have an inefficient implementation (e.g. that constructs a new Vector on every call), or we might see multiple redundant types all similar to StringBytes.

I can rename this CodeUnits and return it as a default implementation of codeunits. How's that?

@StefanKarpinski
Copy link
Member

I think if the type could be called CodeUnits{UInt8} that would be good. Otherwise, we can just continue to call this StringBytes for now and in the future if CodeUnits{UInt8} or ImmutableVector{UInt8} becomes a more general version of this and StringBytes an alias.

@JeffBezanson
Copy link
Member Author

OK, see how this grabs you.

let v = unsafe_wrap(Vector{UInt8}, "abc")
s = String(v)
@test_throws BoundsError v[1]
push!(v, UInt8('x'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Why doesn't this line throw an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to. When push! is called new storage is allocated for the vector (also as a String in case you want to make more strings from it).

@@ -60,7 +60,11 @@ This representation is often appropriate for passing strings to C.
String(s::AbstractString) = print_to_string(s)
String(s::Symbol) = unsafe_string(unsafe_convert(Ptr{UInt8}, s))

(::Type{Vector{UInt8}})(s::String) = ccall(:jl_string_to_array, Ref{Vector{UInt8}}, (Any,), s)
unsafe_wrap(::Type{Vector{UInt8}}, s::String) = ccall(:jl_string_to_array, Ref{Vector{UInt8}}, (Any,), s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have unsafe_wrap(::Type{Array}, s) = unsafe_wrap(Vector{UInt8}, s)?

@StefanKarpinski
Copy link
Member

This looks good to me. I guess the wrapping of string object itself is fairly essential here, which inherently couples the CodeUnits type to the strings that it wraps. I guess it's possible that we might at some point be able to return a generic immutable code unit vector object, but that seems fairly distant, so probably 2.0 material in any case.

@JeffBezanson
Copy link
Member Author

be able to return a generic immutable code unit vector object

What's the difference? This should work for anything that implements ncodeunits and codeunit.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jan 2, 2018

What's the difference?

Just that CodeUnits{UInt8,String} is string-specific whereas we could potentially just want an ImmutableVector{UIn8} type and return that. The ideal structuring here would be inverted: a String would be a wrapper around an ImmutableVector{UInt8} object, which is what we would return from codeunits(::String). Given that we're not going to have that for 1.0, I think this arrangement is good.

@JeffBezanson JeffBezanson changed the title RFC: safer vector<->string conversions, fixing #24388 safer vector<->string conversions, fixing #24388 Jan 3, 2018
@JeffBezanson JeffBezanson merged commit 2043060 into master Jan 3, 2018
@JeffBezanson JeffBezanson deleted the jb/vectorstring branch January 3, 2018 18:48
maleadt added a commit to JuliaGPU/CUDAdrv.jl that referenced this pull request Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants