-
Notifications
You must be signed in to change notification settings - Fork 3
Add unsafe package to perform zero-allocation conversion from string to byte slice #49
Conversation
unsafe/string.go
Outdated
if len(s) == 0 { | ||
return nil | ||
} | ||
sh := (*reflect.StringHeader)(unsafe.Pointer(&s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You actually have to do all casts (string
-> StringHeader
-> SliceHeader
-> []byte
) in the same expression, because currently after you cast s
to StringHeader
, a potential GC from the future version of Go could preempt the goroutine and move the pointer in s
, while sh
's uintptr will be left unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not actually come to fruition. There's a proposal to add two new types unsafe.String
and unsafe.Slice
to replace reflect.StringHeader
and reflect.SliceHeader
. The Data
field of the new types would be unsafe.Pointer
instead of uintptr
and unsafe.Pointer
types will be updated by GC from what I can gather on this thread . The proposal is marked Go2 though so I gather it's not very high on the list of priorities. With that being said, I also don't know if Go will have a compacting GC anytime soon.
unsafe/string.go
Outdated
// where we assign the data pointer of the string header to the data | ||
// pointer of the slice header to make sure the underlying bytes don't | ||
// get GC'ed before the assignment happens. | ||
return *(*[]byte)(unsafe.Pointer(&reflect.SliceHeader{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unsafe package states: "In general, reflect.SliceHeader and reflect.StringHeader should be used only as *reflect.SliceHeader and *reflect.StringHeader pointing at actual slices or strings, never as plain structs. A program should not declare or allocate variables of these struct types." And they offer the following pattern as being a valid way to update a string's Data
field:
var s string
hdr := (*reflect.StringHeader)(unsafe.Pointer(&s)) // case 1
hdr.Data = uintptr(unsafe.Pointer(p)) // case 6 (this case)
hdr.Len = n
Consequently, I think we may be better off doing something of the form:
var b []byte
byteHeader := (*reflect.SliceHeader)(unsafe.Pointer(&b))
byteHeader.Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
l := len(s)
byteHeader.Len = l
byteHeader.Cap = l
return b
This way we:
- Address Andrey's comment above by setting the
Data
field of the byte slice to be the same as theData
field of the string in one expression. - Ensure
s
is live after updating theData
fields by callinglen(s)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment still stays here, since there's a gap between reading the pointer from string header and writing a pointer to slice header, which potentially in the future can be used by the GC to preempt the goroutine. You literally need to read and write in the same expression, e.g.:
return *(*[]byte)(unsafe.Pointer(&reflect.SliceHeader{
Data: (*reflect.StringHeader)(unsafe.Pointer(&s)).Data,
Cap: len(s),
Len: len(s),
}))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'd like to point out that since we dynamically create a slice here, its memory won't be tracked by the GC and if the "underlying" string gets collected at any point in time while the slice is still being used, then the slice will end up having a dangling reference leading to all kinds of weird behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fail to see how you're comment applies here, why would Data: (*reflect.StringHeader)(unsafe.Pointer(&s)).Data,
be okay but not byteHeader.Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
? It appears equivalent to your example to me since afterwards we have a byte slice pointing to the memory.
I also don't see why the memory wouldn't be tracked by GC. When the GC looks at the root pointers on the stack it will see a byte slice pointing to the memory that used to be pointed to by the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean to say is that if the GC can interrupt the load and subsequent store in the expression byteHeader.Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
then surely it could do the same when we create the struct and do Data: (*reflect.StringHeader)(unsafe.Pointer(&s)).Data,
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, my comment about dangling pointers applies to all examples and is valid even in the light of the documentation code snippet provided by Jerome since p
there (on line 3) is a valid pointer all along, unlike sh.Data
in our examples.
My comment about GC tracking memory after the string gets collected applies to the original code, not the proposed changes where we start with allocating a real []byte
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there is a problem in the xxhash implementation, I offered it just to highlight the fact that it takes the approach of allocating a real byte slice (i.e. var b []byte
) and not allocating a byte slice through the use of a SliceHeader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree about the possibility of a dangling pointer, but it will only be an issue for a GC that moves memory and updates references. There is already a proposal to introduce a new unsafe representation for slices and strings whose Data
field would be an unsafe.Pointer
instead of an uintptr
. If that proposal lands before a compacting GC is introduced then we can update this package to use those new types and won't run into this issue even with a compacting GC. Needless to say, we'll need to be vigilant during new Go releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually, on second thought, we should be fine even if the GC begins moving references with this approach because what we are doing is using the reflect Header structs to update the internal unsafe.Pointer
s in the runtime representation of strings and slices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we'd have a dangling pointer. If GC relocates the string's underlying memory, it'll need to fix up the str field in the stringStruct object on the stack:
type stringStruct struct {
str unsafe.Pointer
len int
}
If that's the case, it means it'll automatically fix up the Data pointer in the string header because the string header pointer is literally just reinterpreting cast the stringStruct pointer and points to the exact same location on the stack where the stringStruct variable sits.
1 similar comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cc @kobolog @robskillington @jeromefroe @cw9 @prateek
This PR adds the unsafe package to perform zero-allocation conversion from string to byte slices, which will significantly reduce GC when we need to encode string ids as byte slices.