-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34453: [Go] Support Builders for user defined extensions #34454
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look through, thoughts?
go/arrow/datatype_extension.go
Outdated
// this should return array.Builder interface but we cannot import due to cycle import, so we use | ||
// interface{} instead. At least for | ||
NewBuilder(mem memory.Allocator, dt ExtensionType) interface{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we follow the same pattern as ArrayType
and have a BuilderType
method which returns a reflect.Type
that we use to wrap the ExtensionBuilder
with? This also avoids the import cycle.
Another thing to consider is that this is going to break any and all existing Extension types in other consumers' codebases. We should probably make a second interface type which contains the BuilderType
method so that we can just use a type assertion test in NewBuilder
rather than break existing consumers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the first part, we can keep the same pattern, sure. But actually I think this should 1) provide better performance, no reflection. 2) much better type safety, and better developer experience as it is clear what should be the type of this function (I also wanted to return Builder but had to fallback to interface due to some cyclic import that didn't want to fix part of this PR).
Re 2nd thing, not sure I followed, can you share an example of what this going to break? The only place that we use this function now is here and I fallback to the previous builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the second point: Note that you had to add the NewBuilder
method to the existing extension types. By adding a new method to the interface, anyone who has their own extension types defined that doesn't already have this method defined will have a compile error when they upgrade to this version of Arrow which adds the method because their structs will no longer meet the interface for ExtensionType
.
But actually I think this should 1) provide better performance, no reflection. 2) much better type safety, and better developer experience as it is clear what should be the type of this function (I also wanted to return Builder but had to fallback to interface due to some cyclic import that didn't want to fix part of this PR).
In general, I wouldn't expect creating a builder to be a performance bottleneck as consumers shouldn't create builders repeatedly. That said, a couple ideas I've had so far:
- Like was previously done for Arrays, we could shift the definition of the
Builder
interface to thearrow
package directly and then add an alias in thearray
package to ensure we don't break any consumers (with a deprecated message telling people to point atarrow.Builder
instead. - If we're going in this direction, rather than passing the allocator and the extension type, we should pass an
ExtensionBuilder
to the method and have this just wrap it and return the wrapped builder. The consumer can also retrieve the extension type and the allocator from the builder directly if they need to. So perhaps something likeWrapBuilder(ExtensionBuilder) Builder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re breaking, I don't think this will breaking anything as there is a default in the ExtensionType
that just returns nil
to keep this backward compatible so if this is not implement I call the old NewExtensionBuilder
in the NewBuilder
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeroshade can you please take a look at this comment ^ ? (This is why I didn't re-requested a re-review yet as need your guidance here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I missed that! Sorry! and you do check for the nil in NewBuilder
, okay that's fair. So that avoids the breaking, you're completely correct.
So just need to address the other point in possibly changing the signature / solving the import cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re using Builder
in the signature I think it might be more complex than it seems (I tried it initially). For example newData *Data
function which means we need also to take Data
to arrow
pacakge. In that case it might be even easier to move datatype_extension.go
to array
because we can import from array
the arrow
package but not the otherway around (Also, there are even more complications because the interface has private functions which would cause compilation error if we move it to a different package). Can't think of a much easier way without a big refactor which I think better to avoid atm and just do the runtime check. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Darn, that's really annoying..... I agree it's better to avoid a big refactor atm.
I still want to change the signature though instead of having consumers have to create both the underlying storage builder & wrap it.
Instead of adding this method to the ExtensionType
interface, we could just go with an interface defined in the array
package which would let us use Builder
in the method definition. something like:
type ExtensionTypeCustomBuilder interface {
NewBuilder(ExtensionBuilder) Builder
}
Then in builder.go
you can do:
case arrow.EXTENSION:
typ := dtype.(arrow.ExtensionType)
bldr := NewExtensionBuilder(mem, typ)
if custom, ok := typ.(ExtensionTypeCustomBuilder); ok {
return custom.NewBuilder(bldr)
}
return bldr
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that! Updated.
Blocked by cloudquery/plugin-sdk#724 but ready for initial review and discussion. I can do a short walkthrough for anyone up for review. Apache arrow fork is here (We use it until [this](apache/arrow#34454) is merged): https://github.com/cloudquery/arrow/tree/feat_extension_builder. Some more notes: - Right now we are just migrating the json writer/reader to use apache arrow so we can roll format by format and see if there are any real world issues before we roll this to everywhere instead of our own type system.
@zeroshade Thanks for the initial review! Commented and also linked to a few example on how we use it. |
b6560ef
to
2e0ae3d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just request a re-review once this is ready for a new review, thanks.
4a8a202
to
1d38d96
Compare
1d38d96
to
8f8429a
Compare
Ready now for another review, though looks like some workflows are failing. I think this is not connected though to this PR |
"golang.org/x/xerrors" | ||
) | ||
|
||
type UUIDBuilder struct { | ||
*array.ExtensionBuilder | ||
dtype *UUIDType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the failures are because this field is unused here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok. Fixed. I wasn't able to find it from the action. Where should I look for it next time? (Can I run those linters locally as well?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that was actually found by the Go build failing, and I saw it by looking at the log file for the failing Go workflows.
You can also run the linters (or most of the workflows) by following the instructions here: https://arrow.apache.org/docs/developers/continuous_integration/archery.html to install the archery
utility in the repo and then running archery docker run <job>
as described here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Thanks, I'll try those for next PR! Tests passing now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more nitpicks. We're almost there! Thanks again for all this!
data := make([][]byte, len(v)) | ||
for i, v := range v { | ||
data[i] = v[:] | ||
} | ||
b.ExtensionBuilder.Builder.(*array.FixedSizeBinaryBuilder).AppendValues(data, valid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be done more efficiently if desired via resize + UnsafeAppend
or other methods. Not necessary for this PR but just something to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a single whitespace nitpick haha. Fix that and i'll merge this 😄
Benchmark runs are scheduled for baseline = 71f3c56 and contender = 5219de3. 5219de3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This should serve as discussion as it's a medium change but this should Close #34453 and give users the ability to define custom Builder for their extensions just like they define ExtensionTypes and ExtensionArrays