Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HMS Support for Recap #260

Closed
wants to merge 10 commits into from
Closed

HMS Support for Recap #260

wants to merge 10 commits into from

Conversation

cpard
Copy link

@cpard cpard commented Jun 11, 2023

@criccomini hey dude, take a look at this and let me know what you think, if that approach looks good to you. I can continue.

@criccomini
Copy link
Contributor

criccomini commented Jun 12, 2023

Ya, this looks like a good start. My main nit is the package structure.

I've got a recap.readers package. HiveMetastoreReader should live under this package space. If you need multiple files (it looks like that's a must with Thrift), then I suggest recap.readers.hive.metastore as the package. The Thrift stuff can probably live under recap.readers.hive.metastore.thrift.

I left a few dependency comments and a question as well. I've never used Thrift so I'm still feeling my way around it a bit.

pyproject.toml Outdated
@@ -19,6 +19,8 @@ dependencies = [
"frictionless[json,parquet]>=5.6.3",
"tomli>=2.0.1",
"genson>=1.2.2",
"thrift>=0.16.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like this should be in a hive-metastore group.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right! I'll clean up the groups and I'll continue working on adding the rest of the functionality. This was a very quick PR just to share the generated client with you.

pyproject.toml Outdated
@@ -19,6 +19,8 @@ dependencies = [
"frictionless[json,parquet]>=5.6.3",
"tomli>=2.0.1",
"genson>=1.2.2",
"thrift>=0.16.0",
"psycopg2-binary>=2.9.6",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

psycopg2 is already in dbs group below.

recap/hms/hms.py Outdated
def get_fields(self, database, table):
return self.client.get_fields(database, table)

print("hello!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sorry for that! Need some cleaning!

recap/hms/hms.py Outdated
from thrift.transport import TTransport, TSocket
from thrift.protocol.TBinaryProtocol import TBinaryProtocol

class _HMS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this autogenerated?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _HMS no, I added it as a way to reduce the surface area of the thrift client that is exposed through Recap. The client has a lot of functionality that is not necessarily needed, e.g. to create a catalog or to operate on partitions etc.

@cpard
Copy link
Author

cpard commented Jun 12, 2023

Ya, this looks like a good start. My main nit is the package structure.

I've got a recap.readers package. HiveMetastoreReader should live under this package space. If you need multiple files (it looks like that's a must with Thrift), then I suggest recap.readers.hive.metastore as the package. The Thrift stuff can probably live under recap.readers.hive.metastore.thrift.

I left a few dependency comments and a question as well. I've never used Thrift so I'm still feeling my way around it a bit.

That makes total sense, I'll update the package structure.

cpard added 4 commits June 12, 2023 11:25
added hive data types and basic hive metadata objects
added first implementation of fetching hms table metadata in thrift and return it in the internal representation
LIST = "LIST"
UNION = "UNION"

class HType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why you're creating a new HiveType vs. using recap.types directly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great question and something that was bothering and still bothers me. The Thrift types are too low level in many cases, e.g. the type you get back for the PrincipalType is an int that then you use to instantiate the actual type.

It feels to me that there's a lot of low level complexity there that could / should be hidden from anyone who's building on top of the Hive Metastore. That's why I added this modeling layer for the types. This is also inspired by how Trino is doing it btw in their implementation, I trust their implementation because as a federated query engine they always needed to figure out the right abstractions.

Also, I might be misinterpreting something here in terms of the Recap design. We have Readers and converters. I operated with the assumption that a reader provides access to a "domain" and then there's a converter that turns that into the Recap type system. Is this a wrong assumption?

Copy link
Contributor

@criccomini criccomini Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I might be misinterpreting something here in terms of the Recap design. We have Readers and converters. I operated with the assumption that a reader provides access to a "domain" and then there's a converter that turns that into the Recap type system. Is this a wrong assumption?

More or less.

  • Converter: Given a schema, returns Recap types. The formats are usually text-based (proto, avro, json schema, etc).
  • Reader: Reads a schema from a registry/catalog and returns a StructType.

Readers usually use converters to take the schema that it reads and convert it to a Recap StructType before returning it.

The reason for the distinction is that many different registries/catalogs all have the same schema formats (proto, avro, json schema, thrift). This way the DatahubReader, ConfluentRegistryReader, etc can all use the same converters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I think your HType approach is fine. I've been noodling about breaking the ThriftHiveMetastore stuff you're doing out into a separate pypi package, repo, and project anyway. It seems useful beyond Recap to have a Python-based Hive Metastore API that uses the Thrift interface.

Basically, this library:

https://github.com/quintoandar/hive-metastore-client

But with all the Thrift APIs supported (what you're doing).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree that there's value in breaking this into a separate package. Let's chat about how it should be done and I'm happy to take that direction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want me to set up a stub project in recap-build/pyhive-metastore ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be awesome! set up the stub and I'll work on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened:

https://github.com/recap-build/pymetastore

You're set up as a maintainer. :)

cpard added 3 commits June 16, 2023 12:34
We need some helper code to parse the types we get back from the metastore through Thrift, the types are returned as strings and we need to parse them and figure out what type to create on our side. Things are a bit more complicated when we try to parse types with parameters.
getting closer to finalizing the type parser.
finished type parsing logic. Needs a lot of testing now to make sure it works as advertised. This will come next.
@cpard
Copy link
Author

cpard commented Jun 23, 2023

@criccomini I think this PR should be deleted, right?

@criccomini
Copy link
Contributor

Yea. I'll close it off.

@criccomini criccomini closed this Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants