Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Arrow migration #87

Merged
merged 12 commits into from
Mar 6, 2023
Merged

feat: Arrow migration #87

merged 12 commits into from
Mar 6, 2023

Conversation

yevgenypats
Copy link
Member

@yevgenypats yevgenypats commented Mar 5, 2023

Blocked by cloudquery/plugin-sdk#724

but ready for initial review and discussion. I can do a short walkthrough for anyone up for review.

Apache arrow fork is here (We use it until this is merged): https://github.com/cloudquery/arrow/tree/feat_extension_builder.

Some more notes:

  • Right now we are just migrating the json writer/reader to use apache arrow so we can roll format by format and see if there are any real world issues before we roll this to everywhere instead of our own type system.

internal/cqarrow/arrow.go Outdated Show resolved Hide resolved
Copy link
Member

@erezrokah erezrokah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good (didn't test). So the TDLR for this PR is that we use Apache arrow type system instead of ours (extending types they don't support)?

A walkthrough will be great, and also seeing them accept some of our extended types as contributions

case schema.TypeIntArray:
typ = arrow.ListOf(arrow.PrimitiveTypes.Int64)
case schema.TypeTimestamp:
typ = arrow.FixedWidthTypes.Timestamp_us
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment that us means micros seconds (and not US timezone)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow type is named badly, should've been arrow.FixedWidthTypes.TimestampMicros

return b.Unmarshal(dec)
}

// UUIDArray is a simple array which is a FixedSizeBinary(16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: It looks like all the comments in this file that refer to FixedSizeBinary or UUID need to be updated (they were copied from UUID)

kodiakhq bot pushed a commit to cloudquery/plugin-sdk that referenced this pull request Mar 6, 2023
This should unblock cloudquery/filetypes#87

Add supports for timestamp to decode arrow string and for byte to decode base64 (backward compat is saved)
yevgenypats and others added 9 commits March 6, 2023 15:39
Co-authored-by: Herman Schaaf <hermanschaaf@gmail.com>
Co-authored-by: Kemal <223029+disq@users.noreply.github.com>
Co-authored-by: Herman Schaaf <hermanschaaf@gmail.com>
Co-authored-by: Herman Schaaf <hermanschaaf@gmail.com>
@yevgenypats yevgenypats force-pushed the feat/arrow_migration branch from cded2e3 to 2ce1b28 Compare March 6, 2023 13:40
"github.com/google/go-cmp/cmp"
)

func DiffSchema(sc, o *arrow.Schema) (string, bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is some dead code that can be removed?

if diff := cmp.Diff(arrowSchema.String(), expecetdSchema.String()); diff != "" {
t.Errorf(diff)
}
// if diff, _ := DiffSchema(arrowSchema, expecetdSchema); diff != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The only use of DiffSchema is here and it's commented out)

b.AppendNull()
return
}
bytes, err := json.Marshal(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we need to use json.Encoder here as well, like we had to in plugin-sdk (https://github.com/cloudquery/plugin-sdk/pull/714/files)?

Copy link
Member

@hermanschaaf hermanschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I tested it against the old json filetypes plugin for a couple of sources and the output was the byte-for-byte the same 👍 We may want to consider using json.Encoder but I think it's not a blocker

@yevgenypats yevgenypats added the automerge Add to automerge PRs once requirements are met label Mar 6, 2023
@kodiakhq kodiakhq bot merged commit 135f022 into main Mar 6, 2023
@kodiakhq kodiakhq bot deleted the feat/arrow_migration branch March 6, 2023 18:47
@cq-bot cq-bot mentioned this pull request Mar 6, 2023
kodiakhq bot pushed a commit that referenced this pull request Mar 7, 2023
🤖 I have created a release *beep* *boop*
---


## [1.5.0](v1.4.2...v1.5.0) (2023-03-06)


### Features

* Arrow migration ([#87](#87)) ([135f022](135f022))


### Bug Fixes

* **deps:** Update golang.org/x/xerrors digest to 04be3eb ([#78](#78)) ([5d47135](5d47135))
* **deps:** Update module github.com/apache/thrift to v0.18.0 ([#83](#83)) ([bce51fa](bce51fa))
* **deps:** Update module github.com/cloudquery/plugin-sdk to v1.38.1 ([#72](#72)) ([0eadcfd](0eadcfd))
* **deps:** Update module github.com/cloudquery/plugin-sdk to v1.38.2 ([#74](#74)) ([90bdbba](90bdbba))
* **deps:** Update module github.com/cloudquery/plugin-sdk to v1.39.0 ([#75](#75)) ([b65db2e](b65db2e))
* **deps:** Update module github.com/cloudquery/plugin-sdk to v1.39.1 ([#76](#76)) ([c03025e](c03025e))
* **deps:** Update module github.com/cloudquery/plugin-sdk to v1.40.0 ([#77](#77)) ([7ec1df0](7ec1df0))
* **deps:** Update module github.com/cloudquery/plugin-sdk to v1.41.0 ([#86](#86)) ([5b89f33](5b89f33))
* **deps:** Update module github.com/cloudquery/plugin-sdk to v1.42.0 ([#88](#88)) ([99da3b2](99da3b2))
* **deps:** Update module github.com/klauspost/compress to v1.16.0 ([#84](#84)) ([a9c9da3](a9c9da3))
* **deps:** Update module github.com/pierrec/lz4/v4 to v4.1.17 ([#79](#79)) ([fbc42d4](fbc42d4))
* **deps:** Update module github.com/stretchr/testify to v1.8.2 ([#80](#80)) ([b080141](b080141))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge Add to automerge PRs once requirements are met feat
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants