-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HMS Support for Recap #260
Conversation
Ya, this looks like a good start. My main nit is the package structure. I've got a I left a few dependency comments and a question as well. I've never used Thrift so I'm still feeling my way around it a bit. |
pyproject.toml
Outdated
@@ -19,6 +19,8 @@ dependencies = [ | |||
"frictionless[json,parquet]>=5.6.3", | |||
"tomli>=2.0.1", | |||
"genson>=1.2.2", | |||
"thrift>=0.16.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel like this should be in a hive-metastore
group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right! I'll clean up the groups and I'll continue working on adding the rest of the functionality. This was a very quick PR just to share the generated client with you.
pyproject.toml
Outdated
@@ -19,6 +19,8 @@ dependencies = [ | |||
"frictionless[json,parquet]>=5.6.3", | |||
"tomli>=2.0.1", | |||
"genson>=1.2.2", | |||
"thrift>=0.16.0", | |||
"psycopg2-binary>=2.9.6", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
psycopg2 is already in dbs
group below.
recap/hms/hms.py
Outdated
def get_fields(self, database, table): | ||
return self.client.get_fields(database, table) | ||
|
||
print("hello!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, sorry for that! Need some cleaning!
recap/hms/hms.py
Outdated
from thrift.transport import TTransport, TSocket | ||
from thrift.protocol.TBinaryProtocol import TBinaryProtocol | ||
|
||
class _HMS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this autogenerated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _HMS no, I added it as a way to reduce the surface area of the thrift client that is exposed through Recap. The client has a lot of functionality that is not necessarily needed, e.g. to create a catalog or to operate on partitions etc.
That makes total sense, I'll update the package structure. |
added hive data types and basic hive metadata objects
added first implementation of fetching hms table metadata in thrift and return it in the internal representation
LIST = "LIST" | ||
UNION = "UNION" | ||
|
||
class HType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why you're creating a new HiveType vs. using recap.types
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great question and something that was bothering and still bothers me. The Thrift types are too low level in many cases, e.g. the type you get back for the PrincipalType is an int that then you use to instantiate the actual type.
It feels to me that there's a lot of low level complexity there that could / should be hidden from anyone who's building on top of the Hive Metastore. That's why I added this modeling layer for the types. This is also inspired by how Trino is doing it btw in their implementation, I trust their implementation because as a federated query engine they always needed to figure out the right abstractions.
Also, I might be misinterpreting something here in terms of the Recap design. We have Readers and converters. I operated with the assumption that a reader provides access to a "domain" and then there's a converter that turns that into the Recap type system. Is this a wrong assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I might be misinterpreting something here in terms of the Recap design. We have Readers and converters. I operated with the assumption that a reader provides access to a "domain" and then there's a converter that turns that into the Recap type system. Is this a wrong assumption?
More or less.
- Converter: Given a schema, returns Recap types. The formats are usually text-based (proto, avro, json schema, etc).
- Reader: Reads a schema from a registry/catalog and returns a StructType.
Readers usually use converters to take the schema that it reads and convert it to a Recap StructType before returning it.
The reason for the distinction is that many different registries/catalogs all have the same schema formats (proto, avro, json schema, thrift). This way the DatahubReader, ConfluentRegistryReader, etc can all use the same converters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I think your HType approach is fine. I've been noodling about breaking the ThriftHiveMetastore stuff you're doing out into a separate pypi package, repo, and project anyway. It seems useful beyond Recap to have a Python-based Hive Metastore API that uses the Thrift interface.
Basically, this library:
https://github.com/quintoandar/hive-metastore-client
But with all the Thrift APIs supported (what you're doing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I agree that there's value in breaking this into a separate package. Let's chat about how it should be done and I'm happy to take that direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want me to set up a stub project in recap-build/pyhive-metastore ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be awesome! set up the stub and I'll work on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some helper code to parse the types we get back from the metastore through Thrift, the types are returned as strings and we need to parse them and figure out what type to create on our side. Things are a bit more complicated when we try to parse types with parameters.
getting closer to finalizing the type parser.
finished type parsing logic. Needs a lot of testing now to make sure it works as advertised. This will come next.
@criccomini I think this PR should be deleted, right? |
Yea. I'll close it off. |
@criccomini hey dude, take a look at this and let me know what you think, if that approach looks good to you. I can continue.