Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34453: [Go] Support Builders for user defined extensions #34454

Merged
merged 8 commits into from
Mar 10, 2023

Conversation

yevgenypats
Copy link
Contributor

@yevgenypats yevgenypats commented Mar 4, 2023

This should serve as discussion as it's a medium change but this should Close #34453 and give users the ability to define custom Builder for their extensions just like they define ExtensionTypes and ExtensionArrays

@yevgenypats yevgenypats requested a review from zeroshade as a code owner March 4, 2023 19:45
@github-actions
Copy link

github-actions bot commented Mar 4, 2023

@github-actions
Copy link

github-actions bot commented Mar 4, 2023

⚠️ GitHub issue #34453 has been automatically assigned in GitHub to PR creator.

@kou kou changed the title GH-34453: [GO] Support Builders for user defined extensions GH-34453: [Go] Support Builders for user defined extensions Mar 4, 2023
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look through, thoughts?

Comment on lines 127 to 129
// this should return array.Builder interface but we cannot import due to cycle import, so we use
// interface{} instead. At least for
NewBuilder(mem memory.Allocator, dt ExtensionType) interface{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we follow the same pattern as ArrayType and have a BuilderType method which returns a reflect.Type that we use to wrap the ExtensionBuilder with? This also avoids the import cycle.

Another thing to consider is that this is going to break any and all existing Extension types in other consumers' codebases. We should probably make a second interface type which contains the BuilderType method so that we can just use a type assertion test in NewBuilder rather than break existing consumers?

Copy link
Contributor Author

@yevgenypats yevgenypats Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the first part, we can keep the same pattern, sure. But actually I think this should 1) provide better performance, no reflection. 2) much better type safety, and better developer experience as it is clear what should be the type of this function (I also wanted to return Builder but had to fallback to interface due to some cyclic import that didn't want to fix part of this PR).

Re 2nd thing, not sure I followed, can you share an example of what this going to break? The only place that we use this function now is here and I fallback to the previous builder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the second point: Note that you had to add the NewBuilder method to the existing extension types. By adding a new method to the interface, anyone who has their own extension types defined that doesn't already have this method defined will have a compile error when they upgrade to this version of Arrow which adds the method because their structs will no longer meet the interface for ExtensionType.

But actually I think this should 1) provide better performance, no reflection. 2) much better type safety, and better developer experience as it is clear what should be the type of this function (I also wanted to return Builder but had to fallback to interface due to some cyclic import that didn't want to fix part of this PR).

In general, I wouldn't expect creating a builder to be a performance bottleneck as consumers shouldn't create builders repeatedly. That said, a couple ideas I've had so far:

  • Like was previously done for Arrays, we could shift the definition of the Builder interface to the arrow package directly and then add an alias in the array package to ensure we don't break any consumers (with a deprecated message telling people to point at arrow.Builder instead.
  • If we're going in this direction, rather than passing the allocator and the extension type, we should pass an ExtensionBuilder to the method and have this just wrap it and return the wrapped builder. The consumer can also retrieve the extension type and the allocator from the builder directly if they need to. So perhaps something like WrapBuilder(ExtensionBuilder) Builder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re breaking, I don't think this will breaking anything as there is a default in the ExtensionType that just returns nil to keep this backward compatible so if this is not implement I call the old NewExtensionBuilder in the NewBuilder function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zeroshade can you please take a look at this comment ^ ? (This is why I didn't re-requested a re-review yet as need your guidance here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed that! Sorry! and you do check for the nil in NewBuilder, okay that's fair. So that avoids the breaking, you're completely correct.

So just need to address the other point in possibly changing the signature / solving the import cycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re using Builder in the signature I think it might be more complex than it seems (I tried it initially). For example newData *Data function which means we need also to take Data to arrow pacakge. In that case it might be even easier to move datatype_extension.go to array because we can import from array the arrow package but not the otherway around (Also, there are even more complications because the interface has private functions which would cause compilation error if we move it to a different package). Can't think of a much easier way without a big refactor which I think better to avoid atm and just do the runtime check. WDYT ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Darn, that's really annoying..... I agree it's better to avoid a big refactor atm.

I still want to change the signature though instead of having consumers have to create both the underlying storage builder & wrap it.

Instead of adding this method to the ExtensionType interface, we could just go with an interface defined in the array package which would let us use Builder in the method definition. something like:

type ExtensionTypeCustomBuilder interface {
    NewBuilder(ExtensionBuilder) Builder
}

Then in builder.go you can do:

case arrow.EXTENSION:
        typ := dtype.(arrow.ExtensionType)
        bldr := NewExtensionBuilder(mem, typ)
        if custom, ok := typ.(ExtensionTypeCustomBuilder); ok {
               return custom.NewBuilder(bldr)
        }
        return bldr

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that! Updated.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 6, 2023
kodiakhq bot pushed a commit to cloudquery/filetypes that referenced this pull request Mar 6, 2023
Blocked by cloudquery/plugin-sdk#724

but ready for initial review and discussion. I can do a short walkthrough for anyone up for review.

Apache arrow fork is here (We use it until [this](apache/arrow#34454) is merged): https://github.com/cloudquery/arrow/tree/feat_extension_builder.

Some more notes:
- Right now we are just migrating the json writer/reader to use apache arrow so we can roll format by format and see if there are any real world issues before we roll this to everywhere instead of our own type system.
@yevgenypats
Copy link
Contributor Author

Took a look through, thoughts?

@zeroshade Thanks for the initial review! Commented and also linked to a few example on how we use it.

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just request a re-review once this is ready for a new review, thanks.

@yevgenypats yevgenypats force-pushed the feat/extension_builder branch from 4a8a202 to 1d38d96 Compare March 9, 2023 07:54
@yevgenypats yevgenypats force-pushed the feat/extension_builder branch from 1d38d96 to 8f8429a Compare March 9, 2023 07:58
@yevgenypats yevgenypats requested a review from zeroshade March 9, 2023 08:00
@yevgenypats
Copy link
Contributor Author

Ready now for another review, though looks like some workflows are failing. I think this is not connected though to this PR

"golang.org/x/xerrors"
)

type UUIDBuilder struct {
*array.ExtensionBuilder
dtype *UUIDType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the failures are because this field is unused here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok. Fixed. I wasn't able to find it from the action. Where should I look for it next time? (Can I run those linters locally as well?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that was actually found by the Go build failing, and I saw it by looking at the log file for the failing Go workflows.

You can also run the linters (or most of the workflows) by following the instructions here: https://arrow.apache.org/docs/developers/continuous_integration/archery.html to install the archery utility in the repo and then running archery docker run <job> as described here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks, I'll try those for next PR! Tests passing now.

@yevgenypats yevgenypats requested a review from zeroshade March 9, 2023 16:58
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more nitpicks. We're almost there! Thanks again for all this!

go/arrow/array/extension_builder.go Show resolved Hide resolved
go/arrow/datatype_extension.go Outdated Show resolved Hide resolved
go/arrow/internal/testing/types/extension_test.go Outdated Show resolved Hide resolved
go/arrow/internal/testing/types/extension_types.go Outdated Show resolved Hide resolved
Comment on lines +55 to +59
data := make([][]byte, len(v))
for i, v := range v {
data[i] = v[:]
}
b.ExtensionBuilder.Builder.(*array.FixedSizeBinaryBuilder).AppendValues(data, valid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be done more efficiently if desired via resize + UnsafeAppend or other methods. Not necessary for this PR but just something to think about.

go/arrow/internal/testing/types/extension_types.go Outdated Show resolved Hide resolved
@yevgenypats yevgenypats requested a review from zeroshade March 10, 2023 07:21
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a single whitespace nitpick haha. Fix that and i'll merge this 😄

go/arrow/datatype_extension.go Outdated Show resolved Hide resolved
@yevgenypats yevgenypats requested a review from zeroshade March 10, 2023 18:15
@zeroshade zeroshade merged commit 5219de3 into apache:main Mar 10, 2023
@ursabot
Copy link

ursabot commented Mar 11, 2023

Benchmark runs are scheduled for baseline = 71f3c56 and contender = 5219de3. 5219de3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.24% ⬆️0.09%] test-mac-arm
[Finished ⬇️1.79% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.82% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 5219de33 ec2-t3-xlarge-us-east-2
[Failed] 5219de33 test-mac-arm
[Finished] 5219de33 ursa-i9-9960x
[Finished] 5219de33 ursa-thinkcentre-m75q
[Finished] 71f3c568 ec2-t3-xlarge-us-east-2
[Finished] 71f3c568 test-mac-arm
[Finished] 71f3c568 ursa-i9-9960x
[Finished] 71f3c568 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Go] Extension Builder Interface
3 participants