Skip to content
This repository has been archived by the owner on Apr 8, 2019. It is now read-only.

Add unsafe package to perform zero-allocation conversion from string to byte slice #49

Merged
merged 2 commits into from
Mar 15, 2017

Conversation

xichen2020
Copy link
Contributor

cc @kobolog @robskillington @jeromefroe @cw9 @prateek

This PR adds the unsafe package to perform zero-allocation conversion from string to byte slices, which will significantly reduce GC when we need to encode string ids as byte slices.

@coveralls
Copy link

coveralls commented Mar 14, 2017

Coverage Status

Coverage increased (+0.01%) to 85.033% when pulling c84f0c5 on xichen-unsafe-string-byteslice-conversion into 2adc966 on master.

unsafe/string.go Outdated
if len(s) == 0 {
return nil
}
sh := (*reflect.StringHeader)(unsafe.Pointer(&s))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You actually have to do all casts (string -> StringHeader -> SliceHeader -> []byte) in the same expression, because currently after you cast s to StringHeader, a potential GC from the future version of Go could preempt the goroutine and move the pointer in s, while sh's uintptr will be left unchanged.

Copy link
Contributor

@jeromefroe jeromefroe Mar 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not actually come to fruition. There's a proposal to add two new types unsafe.String and unsafe.Slice to replace reflect.StringHeader and reflect.SliceHeader. The Data field of the new types would be unsafe.Pointer instead of uintptr and unsafe.Pointer types will be updated by GC from what I can gather on this thread . The proposal is marked Go2 though so I gather it's not very high on the list of priorities. With that being said, I also don't know if Go will have a compacting GC anytime soon.

unsafe/string.go Outdated
// where we assign the data pointer of the string header to the data
// pointer of the slice header to make sure the underlying bytes don't
// get GC'ed before the assignment happens.
return *(*[]byte)(unsafe.Pointer(&reflect.SliceHeader{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unsafe package states: "In general, reflect.SliceHeader and reflect.StringHeader should be used only as *reflect.SliceHeader and *reflect.StringHeader pointing at actual slices or strings, never as plain structs. A program should not declare or allocate variables of these struct types." And they offer the following pattern as being a valid way to update a string's Data field:

var s string
hdr := (*reflect.StringHeader)(unsafe.Pointer(&s)) // case 1
hdr.Data = uintptr(unsafe.Pointer(p))              // case 6 (this case)
hdr.Len = n

Consequently, I think we may be better off doing something of the form:

var b []byte
byteHeader := (*reflect.SliceHeader)(unsafe.Pointer(&b))
byteHeader.Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data

l := len(s)
byteHeader.Len = l
byteHeader.Cap = l
return b

This way we:

  1. Address Andrey's comment above by setting the Data field of the byte slice to be the same as the Data field of the string in one expression.
  2. Ensure s is live after updating the Data fields by calling len(s).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment still stays here, since there's a gap between reading the pointer from string header and writing a pointer to slice header, which potentially in the future can be used by the GC to preempt the goroutine. You literally need to read and write in the same expression, e.g.:

return *(*[]byte)(unsafe.Pointer(&reflect.SliceHeader{
    Data: (*reflect.StringHeader)(unsafe.Pointer(&s)).Data,
    Cap: len(s),
    Len: len(s),
}))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'd like to point out that since we dynamically create a slice here, its memory won't be tracked by the GC and if the "underlying" string gets collected at any point in time while the slice is still being used, then the slice will end up having a dangling reference leading to all kinds of weird behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to see how you're comment applies here, why would Data: (*reflect.StringHeader)(unsafe.Pointer(&s)).Data, be okay but not byteHeader.Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data? It appears equivalent to your example to me since afterwards we have a byte slice pointing to the memory.

I also don't see why the memory wouldn't be tracked by GC. When the GC looks at the root pointers on the stack it will see a byte slice pointing to the memory that used to be pointed to by the string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean to say is that if the GC can interrupt the load and subsequent store in the expression byteHeader.Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data then surely it could do the same when we create the struct and do Data: (*reflect.StringHeader)(unsafe.Pointer(&s)).Data,.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, my comment about dangling pointers applies to all examples and is valid even in the light of the documentation code snippet provided by Jerome since p there (on line 3) is a valid pointer all along, unlike sh.Data in our examples.
My comment about GC tracking memory after the string gets collected applies to the original code, not the proposed changes where we start with allocating a real []byte.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there is a problem in the xxhash implementation, I offered it just to highlight the fact that it takes the approach of allocating a real byte slice (i.e. var b []byte) and not allocating a byte slice through the use of a SliceHeader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree about the possibility of a dangling pointer, but it will only be an issue for a GC that moves memory and updates references. There is already a proposal to introduce a new unsafe representation for slices and strings whose Data field would be an unsafe.Pointer instead of an uintptr. If that proposal lands before a compacting GC is introduced then we can update this package to use those new types and won't run into this issue even with a compacting GC. Needless to say, we'll need to be vigilant during new Go releases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually, on second thought, we should be fine even if the GC begins moving references with this approach because what we are doing is using the reflect Header structs to update the internal unsafe.Pointers in the runtime representation of strings and slices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we'd have a dangling pointer. If GC relocates the string's underlying memory, it'll need to fix up the str field in the stringStruct object on the stack:

type stringStruct struct {
  str unsafe.Pointer
  len int
}

If that's the case, it means it'll automatically fix up the Data pointer in the string header because the string header pointer is literally just reinterpreting cast the stringStruct pointer and points to the exact same location on the stack where the stringStruct variable sits.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) to 85.072% when pulling a6ce31f on xichen-unsafe-string-byteslice-conversion into 2adc966 on master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) to 85.072% when pulling a6ce31f on xichen-unsafe-string-byteslice-conversion into 2adc966 on master.

Copy link
Contributor

@robskillington robskillington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xichen2020 xichen2020 merged commit bf13462 into master Mar 15, 2017
@xichen2020 xichen2020 deleted the xichen-unsafe-string-byteslice-conversion branch March 15, 2017 21:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants