Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Parquet format arrays #71

Merged
merged 8 commits into from
Feb 19, 2023
Merged

fix: Parquet format arrays #71

merged 8 commits into from
Feb 19, 2023

Conversation

disq
Copy link
Member

@disq disq commented Feb 17, 2023

This PR fixes the array/slice handling in the Parquet filetype by slightly updating the Parquet schema definition.

It's possible to represent UUID types with the proper logical type but it just didn't work with Athena. Also had to do some weird things like defining the uuid as 22-byte FIXED_LEN_BYTE_ARRAY (UUIDs are 16 bytes) but giving it a 16-byte Go slice (not a Go array) and Base64 decoders in reverse transformers etc... And all this didn't even work when it came to UUIDArrays... So I gave up on that for now.

JSON logicaltype seems to somewhat work with DuckDB but doesn't with Athena so I removed that as well.

@github-actions github-actions bot added the fix label Feb 17, 2023
@github-actions github-actions bot added fix and removed fix labels Feb 17, 2023
@disq disq marked this pull request as ready for review February 17, 2023 12:28
@@ -177,6 +177,7 @@ func (ReverseTransformer) ReverseTransformValues(table *schema.Table, values []a
if err := t.Set(v); err != nil {
return nil, fmt.Errorf("failed to convert value %v to type %s: %w", v, table.Columns[i].Type, err)
}
res[i] = t
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, the read-write test panics if there's a type error, instead of failing with a proper (and helpful) error message.

s := pschema.JSONSchemaItemType{
Tag: `name=parquet_go_root, repetitiontype=REQUIRED`,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The item name here doesn't seem to have any significance, parquet_go_root is what the library uses but it's not a const in the parquet world, so I opted to add the table name here so that files are identifiable (with parquet-tools etc.) even if they lost their filenames.

parquet/schema.go Outdated Show resolved Hide resolved
switch col.Type {
case schema.TypeStringArray, schema.TypeIntArray, schema.TypeUUIDArray, schema.TypeCIDRArray, schema.TypeInetArray, schema.TypeMacAddrArray:
opts = append(opts, "repetitiontype=REPEATED")
func arrayElement(t schema.ValueType) schema.ValueType {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use this method in the plugin-sdk in the future.

s.Fields = append(s.Fields, &pschema.JSONSchemaItemType{Tag: tag})

if !isArray(col.Type) { // array types are handled differently, see above
if col.CreationOptions.PrimaryKey || col.CreationOptions.IncrementalKey {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we enforce constraints on incremental keys in other destinations right now, but it seems like a good idea, so I'm fine with this

Copy link
Member

@hermanschaaf hermanschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I tested it out too, looks like it fixes the issue, nice one! 👏

@erezrokah erezrokah added the automerge Add to automerge PRs once requirements are met label Feb 19, 2023
@kodiakhq kodiakhq bot merged commit 4a72a70 into main Feb 19, 2023
@kodiakhq kodiakhq bot deleted the fix/parquet branch February 19, 2023 08:54
kodiakhq bot pushed a commit that referenced this pull request Feb 19, 2023
🤖 I have created a release *beep* *boop*
---


## [1.4.2](v1.4.1...v1.4.2) (2023-02-19)


### Bug Fixes

* **deps:** Update module github.com/cloudquery/plugin-sdk to v1.38.0 ([#69](#69)) ([b6fa8c8](b6fa8c8))
* Parquet format arrays ([#71](#71)) ([4a72a70](4a72a70))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge Add to automerge PRs once requirements are met fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants