Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion of String to Vector{UInt8} #24388

Closed
bkamins opened this issue Oct 28, 2017 · 16 comments · Fixed by #30295
Closed

Conversion of String to Vector{UInt8} #24388

bkamins opened this issue Oct 28, 2017 · 16 comments · Fixed by #30295
Assignees
Labels
strings "Strings!"
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Oct 28, 2017

Consider the following code:

s = "123"
v = Vector{UInt8}(s)

Now changing v mutates s.
Since strings should be immutable I would it find it more natural to:

  • define reinterpret(UInt8, s::String) that would perform such in-place conversion (as this is what one would expect from reinterpret);
  • make convert(Vector{UInt8}, s::String) allocate a new vector (so that users do not accidentally mutate strings).

Additionally this would be consistent with the general behavior that Vector{UInt8}(s) for AbstractString other than String would create a copy.

A non-breaking alternative would be to add an information to the documentation that this conversion is in-place.

@kshyatt kshyatt added the strings "Strings!" label Oct 29, 2017
@StefanKarpinski
Copy link
Member

It does seem like observing the mutation of a string should need to involve something unsafe_: all of these are normal and apparently safe operations, which would seem to violate string immutability.

@bkamins
Copy link
Member Author

bkamins commented Oct 30, 2017

If we want to be on a safe side then String(::Vector{UInt8}) also should make a copy by default. The good thing is that it is documented that it does not do it now (as opposed to Vector{UInt8}(::String)), but changing this would be consistent with the above line of thinking.

Regarding the names - there are probably people with more experience with naming conventions in Base than me. Maybe we can use unsafe_string_to_array and unsafe_array_to_string as the underlying C functions are jl_string_to_array and jl_array_to_string. However, I am open to suggestions.

In summary the proposal would be to:

  • change Vector{UInt8}(::String) so that it performs a copy;
  • change String(::Vector{UInt8}) so that it performs a copy;
  • define two new functions performing unsafe conversions between String and Vector{UInt8}; the proposed names are unsafe_string_to_array and unsafe_array_to_string.

If the above are agreed I can make an appropriate PR.

@vtjnash
Copy link
Member

vtjnash commented Oct 30, 2017

That does sound good. Another possible name suggestion would be to make this a method of unsafe_wrap

@malmaud
Copy link
Contributor

malmaud commented Nov 1, 2017

Maybe we should add this to triage, since this would be a potentially breaking change.

@StefanKarpinski StefanKarpinski added the triage This should be discussed on a triage call label Nov 1, 2017
@JeffBezanson
Copy link
Member

This is going to be a performance disaster. The main problem is that String(take!(buf)) is the idiom for converting an IOBuffer (used as a string builder) to a String in-place. reinterpret(String, take!(buf)) is too long and weird to take its place. Changing String( ) requires a solution to this.

@bkamins
Copy link
Member Author

bkamins commented Nov 1, 2017

Changing String( ) was less of a priority in my opinion as it is well documented - I have added it later to the discussion to have a complete picture.

From my point of view the key problem is to make Vector{UInt8}(s) conversion make a copy. And if only this part would be left, and reinterpret is considered not a very good name for conversion from String to Vector{UInt8}, actually I like adding the unsafe_wrap(::Type{Vector{UInt8}},::String) method as it would not introduce a new name into Base.

@StefanKarpinski
Copy link
Member

As long as some unsafe_ function needs to be called in order to observe string mutability, I think it's fine. The real trouble here is that it's possible to observe string mutation without calling any explicitly unsafe functions.

@JeffBezanson
Copy link
Member

I think the following would be safe and efficient:

  1. Make Vector{UInt8}(s) copy, as requested here. Internally, it should use a String for the new storage so you can go back to String without yet another copy.
  2. String(vec) can continue to share memory (if the vector is String-backed), but we also set the vector's size to 0 and remove its storage so that it's impossible to mutate the string via the vector from then on.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Nov 1, 2017

Being able to "destroy" arrays like that would be generally useful for APIs where a new object "takes control" of the array's contents and anyone else mutating it after that is generally dangerous. And not just for arrays, if we had deterministic finalization a la #7721, being able to make any usage of the finalized object an error after finalization would be ideal.

@JeffBezanson
Copy link
Member

With this, we might want to export StringVector as well, since using it becomes safer.

@JeffBezanson
Copy link
Member

Accepted by triage. We can go with the plan in #24388 (comment). No need to export StringVector.

It would be good to know how many uses of Vector{UInt8}(str) there are. There are a fair number in Base. I think most cases just want to access the bytes in the string, which can be done more efficiently using codeunit. The problem is that changing this will silently give worse performance, so we might want to consider deprecating it at least temporarily.

@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Nov 2, 2017
@JeffBezanson JeffBezanson added this to the 1.0 milestone Nov 2, 2017
@nalimilan
Copy link
Member

This reminds me of a possible Hermitian! constructor which was evoked at #17367 (comment). The problem was somewhat similar: the constructor would have modified the input (slightly) to have it match the type's requirements. A discussion happened about what constructors are allowed to do to their inputs even if they don't have ! in their names.

If we make String(x::Vector{UInt8}) "destroy" its input, we should make it very clear in the manual that all constructors are allowed to do that. Personally, I'd rather use String! for that, since the vast majority of constructors will treat their inputs as const: to keep the language simple, better state that constructors do not mutate their inputs, and force the exceptions to use !. It's easy to replace String(take!(buf)) with String!(take!(buf)) in existing code to preserve performance, that's even something that Femtocleaner could propose by default.

@JeffBezanson
Copy link
Member

Another idea: take!(buf, String), matching read. That way take! can just be considered a destructive version of read.

@StefanKarpinski
Copy link
Member

read!!?

@JeffBezanson JeffBezanson self-assigned this Nov 21, 2017
JeffBezanson added a commit that referenced this issue Dec 30, 2017
@samoconnor
Copy link
Contributor

Current behaviour of Vector{UInt8}(s) is inconsistent between String and SubString, which is correct? Is the intended behaviour documented somewhere? (there is no mention of strings in ?Vector{UInt8}).

julia> VERSION
v"1.0.1"

julia> s, ss = "FOO", SubString("FOO")
("FOO", "FOO")

julia> v, vv = Vector{UInt8}(s), Vector{UInt8}(ss)
(UInt8[0x46, 0x4f, 0x4f], UInt8[0x46, 0x4f, 0x4f])

julia> v[2], vv[2] = 42, 42
(42, 42)

julia> v, vv
(UInt8[0x46, 0x2a, 0x4f], UInt8[0x46, 0x2a, 0x4f])

julia> s, ss
("F*O", "FOO")

unsafe_wrap behaves as expected:

julia> s, ss = "FOO", SubString("FOO")
("FOO", "FOO")

julia> v, vv = unsafe_wrap(Vector{UInt8}, pointer(s), ncodeunits(s)),
               unsafe_wrap(Vector{UInt8}, pointer(ss), ncodeunits(ss))
(UInt8[0x46, 0x4f, 0x4f], UInt8[0x46, 0x4f, 0x4f])

julia> v[2], vv[2] = 42, 42
(42, 42)

julia> s, ss
("F*O", "F*O")

@nalimilan
Copy link
Member

Woops, good catch!

AFAICT #25241 deprecated Vector{UInt8}(s), but after removing the deprecation in 1.0 the fallback AbstractString method is used. Unfortunately it's defined as unsafe_wrap(Vector{UInt8}, String(s)), which makes a copy for non-String arguments, but not for String arguments... Looks like we forgot to change the behavior in 1.0.

Since strings are supposed to be immutable, I'd say it's OK to change this to make a copy in 1.1. That would qualify as a "minor change".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants