-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added documentation to describe interaction with external Hive Metastores #473
Conversation
Codecov Report
@@ Coverage Diff @@
## main #473 +/- ##
==========================================
+ Coverage 80.01% 80.02% +0.01%
==========================================
Files 31 31
Lines 3172 3174 +2
Branches 609 609
==========================================
+ Hits 2538 2540 +2
Misses 486 486
Partials 148 148 |
|
||
### Current Considerations | ||
- Integration with external HMS is set up on individual clusters. | ||
- Theoretically we can integrate separate clusters in a workspace with different HMS repositories. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do we need for it practically? please describe
- Theoretically we can integrate separate clusters in a workspace with different HMS repositories. | ||
- In reality most customers use a single (internal or external) HMS within a workspace. | ||
- When migrating from an external HMS we have to consider that it is used by more than one workspace. | ||
- Integration with external HMS has to be set on all DB Warehouses together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can Ext-HMS WH access workspace-local HMS?
- We should allow overriding cluster settings and instance profile setting to accommodate novel settings. | ||
|
||
### Challenges | ||
- We cannot support multiple HMS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we can store raw delta tables on DBFS
|
||
### Challenges | ||
- We cannot support multiple HMS | ||
- Using an external HMS to persist UCX tables will break functionality for a second workspace using UCX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please suggest couple of options on addressing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the inventory tables, if we use the Workspace ID in either the schema name "ucx-################" or as a column for uniqueness, then we don't have to worry about conflicts.
Likely easier to introduce in the schema name since our teardown scripts drop tables.
### Design Decisions | ||
- We should set up a single HMS for UCX | ||
- We should suggest copying the setup from an existing Cluster/Cluster policy | ||
- We shouldn't override the set up for the DB Warehouses (that may break functionality) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requirement, not a design decision
- We should set up a single HMS for UCX | ||
- We should suggest copying the setup from an existing Cluster/Cluster policy | ||
- We shouldn't override the set up for the DB Warehouses (that may break functionality) | ||
- We should allow overriding cluster settings and instance profile setting to accommodate novel settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is requirement, not a design decision
1. Start the installer. | ||
2. The installer looks for use of external HMS by the workspace. We review cluster policies or DBSQL warehouses settings. | ||
3. We alert the user that an external HMS is set and request ask a YES/NO to set external HMS. | ||
4. We alert the user if they opted for external HMS and the DB Warehouses are not set for external HMS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be step3, not 4 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the concern that a workspace may be connected to an external metatore on standard clusters, but the SQL Warehouse is not? If so, is this common?
- We should allow overriding cluster settings and instance profile setting to accommodate novel settings. | ||
|
||
### Challenges | ||
- We cannot support multiple HMS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we not set up one job task for each unique hms. Usually workspace dont use different HMS, but they likely to have one Ext HMS and one local HMS.
### Challenges | ||
- We cannot support multiple HMS | ||
- Using an external HMS to persist UCX tables will break functionality for a second workspace using UCX | ||
- We should consider using a pattern similar to our integration testing to rename the target database to allow persisting from multiple workspaces. For example WS1 --> UCX_ABC, WS2 --> UCX_DEF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multiple workspaces are sharing the same Ext HMS, we should do table crawl just once but should do multiple grants crawl per each workspace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing the code to only crawl the external HMS once is a premature optimization. This operation only takes a few minutes right now.
**Breaking changes** (existing installations need to reinstall UCX and re-run assessment jobs) * Switched local group migration component to rename groups instead of creating backup groups ([#450](#450)). * Mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together ([#512](#512)). **New features** * Added support for the experimental Databricks CLI launcher ([#517](#517)). * Added support for external Hive Metastores including AWS Glue ([#400](#400)). * Added more views to assessment dashboard ([#474](#474)). * Added rate limit for creating backup group to increase stability ([#500](#500)). * Added deduplication for mount point list ([#569](#569)). * Added documentation to describe interaction with external Hive Metastores ([#473](#473)). * Added failure injection for job failure message propagation ([#591](#591)). * Added uniqueness in the new warehouse name to avoid conflicts on installation ([#542](#542)). * Added a global init script to collect Hive Metastore lineage ([#513](#513)). * Added retry set/update permissions when possible and assess the changes in the workspace ([#519](#519)). * Use `~/.ucx/state.json` to store the state of both dashboards and jobs ([#561](#561)). **Bug fixes** * Fixed handling for `OWN` table permissions ([#571](#571)). * Fixed handling of keys with and without values. ([#514](#514)). * Fixed integration test failures related to concurrent group delete ([#584](#584)). * Fixed issue with workspace listing process on None type `object_type` ([#481](#481)). * Fixed missing group entitlement migration bug ([#583](#583)). * Fixed entitlement application for account-level groups ([#529](#529)). * Fixed assessment throwing an error when the owner of an object is empty ([#485](#485)). * Fixed installer to migrate between different configuration file versions ([#596](#596)). * Fixed cluster policy crawler to be aware of deleted policies ([#486](#486)). * Improved error message for not null constraints violated ([#532](#532)). * Improved integration test resiliency ([#597](#597), [#594](#594), [#586](#586)). * Introduced Safer access to workspace objects' properties. ([#530](#530)). * Mitigated permissions loss in Table ACLs by running appliers with single thread ([#518](#518)). * Running apply permission task before assessment should display message ([#487](#487)). * Split integration tests from blocking the merge queue ([#496](#496)). * Support more than one dashboard per step ([#472](#472)). * Update databricks-sdk requirement from ~=0.11.0 to ~=0.12.0 ([#505](#505)). * Update databricks-sdk requirement from ~=0.12.0 to ~=0.13.0 ([#575](#575)).
**Breaking changes** (existing installations need to reinstall UCX and re-run assessment jobs) * Switched local group migration component to rename groups instead of creating backup groups ([#450](#450)). * Mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together ([#512](#512)). **New features** * Added support for the experimental Databricks CLI launcher ([#517](#517)). * Added support for external Hive Metastores including AWS Glue ([#400](#400)). * Added more views to assessment dashboard ([#474](#474)). * Added rate limit for creating backup group to increase stability ([#500](#500)). * Added deduplication for mount point list ([#569](#569)). * Added documentation to describe interaction with external Hive Metastores ([#473](#473)). * Added failure injection for job failure message propagation ([#591](#591)). * Added uniqueness in the new warehouse name to avoid conflicts on installation ([#542](#542)). * Added a global init script to collect Hive Metastore lineage ([#513](#513)). * Added retry set/update permissions when possible and assess the changes in the workspace ([#519](#519)). * Use `~/.ucx/state.json` to store the state of both dashboards and jobs ([#561](#561)). **Bug fixes** * Fixed handling for `OWN` table permissions ([#571](#571)). * Fixed handling of keys with and without values. ([#514](#514)). * Fixed integration test failures related to concurrent group delete ([#584](#584)). * Fixed issue with workspace listing process on None type `object_type` ([#481](#481)). * Fixed missing group entitlement migration bug ([#583](#583)). * Fixed entitlement application for account-level groups ([#529](#529)). * Fixed assessment throwing an error when the owner of an object is empty ([#485](#485)). * Fixed installer to migrate between different configuration file versions ([#596](#596)). * Fixed cluster policy crawler to be aware of deleted policies ([#486](#486)). * Improved error message for not null constraints violated ([#532](#532)). * Improved integration test resiliency ([#597](#597), [#594](#594), [#586](#586)). * Introduced Safer access to workspace objects' properties. ([#530](#530)). * Mitigated permissions loss in Table ACLs by running appliers with single thread ([#518](#518)). * Running apply permission task before assessment should display message ([#487](#487)). * Split integration tests from blocking the merge queue ([#496](#496)). * Support more than one dashboard per step ([#472](#472)). * Update databricks-sdk requirement from ~=0.11.0 to ~=0.12.0 ([#505](#505)). * Update databricks-sdk requirement from ~=0.12.0 to ~=0.13.0 ([#575](#575)).
**Breaking changes** (existing installations need to reinstall UCX and re-run assessment jobs) * Switched local group migration component to rename groups instead of creating backup groups ([#450](#450)). * Mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together ([#512](#512)). **New features** * Added support for the experimental Databricks CLI launcher ([#517](#517)). * Added support for external Hive Metastores including AWS Glue ([#400](#400)). * Added more views to assessment dashboard ([#474](#474)). * Added rate limit for creating backup group to increase stability ([#500](#500)). * Added deduplication for mount point list ([#569](#569)). * Added documentation to describe interaction with external Hive Metastores ([#473](#473)). * Added failure injection for job failure message propagation ([#591](#591)). * Added uniqueness in the new warehouse name to avoid conflicts on installation ([#542](#542)). * Added a global init script to collect Hive Metastore lineage ([#513](#513)). * Added retry set/update permissions when possible and assess the changes in the workspace ([#519](#519)). * Use `~/.ucx/state.json` to store the state of both dashboards and jobs ([#561](#561)). **Bug fixes** * Fixed handling for `OWN` table permissions ([#571](#571)). * Fixed handling of keys with and without values. ([#514](#514)). * Fixed integration test failures related to concurrent group delete ([#584](#584)). * Fixed issue with workspace listing process on None type `object_type` ([#481](#481)). * Fixed missing group entitlement migration bug ([#583](#583)). * Fixed entitlement application for account-level groups ([#529](#529)). * Fixed assessment throwing an error when the owner of an object is empty ([#485](#485)). * Fixed installer to migrate between different configuration file versions ([#596](#596)). * Fixed cluster policy crawler to be aware of deleted policies ([#486](#486)). * Improved error message for not null constraints violated ([#532](#532)). * Improved integration test resiliency ([#597](#597), [#594](#594), [#586](#586)). * Introduced Safer access to workspace objects' properties. ([#530](#530)). * Mitigated permissions loss in Table ACLs by running appliers with single thread ([#518](#518)). * Running apply permission task before assessment should display message ([#487](#487)). * Split integration tests from blocking the merge queue ([#496](#496)). * Support more than one dashboard per step ([#472](#472)). * Update databricks-sdk requirement from ~=0.11.0 to ~=0.12.0 ([#505](#505)). * Update databricks-sdk requirement from ~=0.12.0 to ~=0.13.0 ([#575](#575)).
Per our discussion.
Analysis of External HMS integration.