Ensure proper sequencing of view migrations #1157

ericvergnaud · 2024-03-28T16:48:44Z

Changes

add views_migrator and corresponding test cases

Linked issues

Resolves #1132

Functionality

added relevant user documentation
added new CLI command
modified existing command: databricks labs ucx ...
added a new workflow
modified existing workflow: ...
added a new table
modified existing table: ...

Tests

manually tested
added unit tests
added integration tests
verified on staging environment (screenshot attached)

…ergnaud/ucx into view-migrations-sequencing

…ergnaud/ucx into view-migrations-sequencing # Conflicts: # tests/unit/hive_metastore/tables/tables_and_views.json # tests/unit/hive_metastore/test_views.py computes valid sequence of direct views (no sequence required)

…ergnaud/ucx into view-migrations-sequencing

nfx · 2024-03-28T18:22:04Z

src/databricks/labs/ucx/hive_metastore/views_migrator.py

+                if table.view_text is None:
+                    self.table_dependencies.add(table)
+                else:
+                    self.view_dependencies.add(table)


do we really need both fields? View can depend on both table and a view...

for sequencing, we're only concerned about the view dependencies, and we don't want to filter the entire list on each sequencing pass

and when debugging it's convenient to see both sets, hence why I keep table_dependencies

nfx · 2024-03-28T18:26:45Z

src/databricks/labs/ucx/hive_metastore/views_migrator.py

+        self.crawler = crawler
+        self.result_view_list: list[ViewToMigrate] = []
+        self.result_tables_set: set[Table] = set()


Suggested change

self.crawler = crawler

self.result_view_list: list[ViewToMigrate] = []

self.result_tables_set: set[Table] = set()

self._crawler = crawler

self._result_view_list: list[ViewToMigrate] = []

self._result_tables_set: set[Table] = set()

need to enforce it in the styleguide :)

nfx · 2024-03-28T18:51:43Z

src/databricks/labs/ucx/hive_metastore/views_migrator.py

+        self.view_dependencies = set()
+
+    def compute_dependencies(self, all_tables: dict[str, Table]):
+        if len(self.table_dependencies) + len(self.view_dependencies) == 0:


Suggested change

if len(self.table_dependencies) + len(self.view_dependencies) == 0:

if len(self.table_dependencies) + len(self.view_dependencies) > 0:

return

can we invert the condition to reduce nesting?

nfx · 2024-03-28T18:53:53Z

src/databricks/labs/ucx/hive_metastore/views_migrator.py

+        self.table = table
+        self.table_dependencies = set()
+        self.view_dependencies = set()


Suggested change

self.table = table

self.table_dependencies = set()

self.view_dependencies = set()

self._table = table

self._table_dependencies = set()

self._view_dependencies = set()

nfx · 2024-03-28T18:57:03Z

src/databricks/labs/ucx/hive_metastore/views_migrator.py

+            view.compute_dependencies(all_tables)
+            not_batched_yet = list(filter(lambda v: v not in self.result_tables_set, view.view_dependencies))
+            if len(not_batched_yet) == 0:
+                result.add(view)


Suggested change

view.compute_dependencies(all_tables)

not_batched_yet = list(filter(lambda v: v not in self.result_tables_set, view.view_dependencies))

if len(not_batched_yet) == 0:

result.add(view)

for dep in view.dependencies(all_tables):

if dep in self._result_tables_set:

continue

result.add(view)

this should be more readable. please hide compute_dependencies as private method there.

Are you sure your version is more readable ? It also adds the view more than once, not great...
Check latest version

nfx · 2024-03-28T19:00:35Z

src/databricks/labs/ucx/hive_metastore/views_migrator.py

+        self.table_dependencies = set()
+        self.view_dependencies = set()
+
+    def compute_dependencies(self, all_tables: dict[str, Table]):


i'd prefer if this method is hidden and the only method we expose is to iterate over dependencies, while we lazily call this behind the cover.

nfx · 2024-03-28T19:01:22Z

src/databricks/labs/ucx/hive_metastore/views_migrator.py

+        self.result_tables_set: set[Table] = set()
+
+    def sequence(self) -> list[Table]:
+        # sequencing is achieved using a very simple algorithm:


that's a neat algorithm. shouln't we integrate it with Index here to line up only the views with only migrated dependencies?..

Possibly, but I don't yet know enough details about the migration integration to provide an opinion. Happy to add that suggestion to a ticket.

@nfx , it's not neat, can improve a lot. Please have a look at my below comment.

codecov · 2024-03-29T08:38:30Z

Codecov Report

Attention: Patch coverage is 91.20879% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 89.80%. Comparing base (739c320) to head (679d028).
Report is 1 commits behind head on main.

Files	Patch %	Lines
...tabricks/labs/ucx/hive_metastore/views_migrator.py	91.01%	4 Missing and 4 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1157   +/-   ##
=======================================
  Coverage   89.80%   89.80%           
=======================================
  Files          61       62    +1     
  Lines        7249     7340   +91     
  Branches     1300     1318   +18     
=======================================
+ Hits         6510     6592   +82     
- Misses        475      479    +4     
- Partials      264      269    +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ergnaud/ucx into view-migrations-sequencing

ericvergnaud · 2024-03-29T09:11:32Z

2 unit tests fail:
tests/unit/hive_metastore/test_mapping.py::test_mismatch_from_table_raises_exception
tests/unit/test_install.py::test_create_database
but they fail in main branch too, these failures are not related to any change in this PR

ganeshgirase · 2024-03-29T09:03:58Z

src/databricks/labs/ucx/hive_metastore/views_migrator.py

+        self.result_tables_set: set[Table] = set()
+
+    def sequence(self) -> list[Table]:
+        # sequencing is achieved using a very simple algorithm:


@nfx , it's not neat, can improve a lot. Please have a look at my below comment.

src/databricks/labs/ucx/hive_metastore/views_migrator.py

Co-authored-by: Ganesh Girase <ganeshgirase@gmail.com>

src/databricks/labs/ucx/hive_metastore/views_migrator.py

* Ensure proper sequencing of view migrations ([#1157](#1157)). In this release, we have introduced a `views_migrator` module and corresponding test cases to ensure proper sequencing of view migrations, addressing issue [#1132](#1132). The module contains two main classes: `ViewToMigrate` and `ViewsMigrator`. The former is responsible for parsing a view's SQL text and identifying its dependencies, while the latter sequences views based on their dependencies. The commit also adds a new method, `__hash__`, to the Table class, which returns a hash value of the key of the table, improving the handling of Table objects. Additionally, we have added unit tests and verified the changes on a staging environment. We have also introduced a new file `tables_and_views.json` for unit testing and added a `views_migrator` module that takes a `TablesCrawler` object and returns a sequence of tables (views) that need to be migrated in the correct order. The commit addresses various scenarios such as no views, direct views, indirect views, deep indirect views, invalid SQL, invalid SQL tables, and circular view references. This release is focused on improving the sequencing of view migrations and is accompanied by appropriate tests. * Experimental support for scanning Delta Tables inside Mount Points ([#1095](#1095)). This commit introduces experimental support for scanning Delta Tables located inside mount points using a new `TablesInMounts` crawler. Users can now scan specific mount points using the `--include-mounts` flag and include Parquet files in the scan results with the `--include-parquet-files` flag. Additionally, the `--filter-paths` flag allows for filtering paths in a mount point and the `--max-depth` flag (currently unimplemented) will filter at a specific sub-folder depth in future development. The project dependencies have been updated to use `databricks-labs-lsql~=0.3.0`. This new feature provides a more granular and flexible way to scan Delta Tables, making the project more user-friendly and adaptable to various use cases. * Fixed `NULL` values in `ucx.views.table_format` to have `UNKNOWN` value instead ([#1156](#1156)). This commit includes a fix for handling NULL values in the `table_format` column of Views in the `ucx.views.table_format` module. Previously, NULL values were displayed as-is, but now they will be replaced with the string "UNKNOWN". This change is part of the fix for issue [#115](#115) * Fixing run_workflow functionality for better error handling ([#1159](#1159)). In this release, the `run_workflow` method in the `workflows.py` file has been updated to improve error handling by waiting for the job to terminate or skip before raising an error, allowing for a more detailed error message to be generated. A new method, `job_initial_run`, has been added to initiate a job run and return the run ID, raising a `NotFound` exception if the job run is not found. The `run_workflow` functionality in the `WorkflowsInstall` module has also been enhanced to handle unexpected error types and improve overall error handling during the installation of products. New test cases have been added and existing ones updated to check how the code handles errors when the run ID is not found or when an `OperationFailed` exception is raised during the installation process. These changes improve the robustness and stability of the system. * Use experimental Permissions Migration API also for Legacy Table ACLs ([#1161](#1161)). This release introduces several changes to the group permissions migration functionality and associated tests. The experimental Permissions Migration API is now being utilized for Legacy Table ACLs, which has led to the removal of the verification step from the experimental group migration job. The `TableAclSupport` import and class have been removed, as they are no longer needed. A new `apply_to_renamed_groups` method has been added for production usage, and a `apply_to_groups_with_different_names` method has been added for integration testing, both of which are part of the Permissions Migration API. Additionally, two tests have been added to support the experimental permissions migration for a group with the same name in the workspace and account. The `permission_manager` parameter has been removed from several test functions in the `test_generic.py` file and replaced with the `MigrationState` class, which is used directly with the `WorkspaceClient` object to apply permissions to groups with different names. The `test_some_entitlements` function in the `test_scim.py` file has also been updated to use the `MigratedGroup` class and the `MigrationState` class's `apply_to_groups_with_different_names` method. Finally, new tests for the Permissions Migration API have been added to the `test_tacl.py` file in the `tests/integration/workspace_access` directory to verify the behavior of the Permissions Migration API when migrating different grants.

ericvergnaud added 9 commits March 27, 2024 17:04

test scaffolding

3f78d98

test scaffolding

6435961

sequences 0 and 1 views

1cbb04f

Merge branch 'view-migrations-sequencing' of https://github.com/ericv…

28e7453

…ergnaud/ucx into view-migrations-sequencing

fix formatting

e2fd873

drop extraneous data

2ec7215

full functional tests

dcc50c9

manage error cases

cf474ed

ericvergnaud requested review from a team and dipankarkush-db March 28, 2024 16:48

ericvergnaud added 3 commits March 28, 2024 17:54

revert unwanted changes

e1290a2

revert unwanted changes

db888a1

prevent infinite loop

67e5c78

ericvergnaud changed the title ~~[WIP] View migrations sequencing~~ [WIP] Ensure proper sequencing of view migrations Mar 28, 2024

ericvergnaud added 3 commits March 28, 2024 18:27

Merge branch 'main' into view-migrations-sequencing

4c755ba

document sequencing algorithm

5356814

Merge branch 'view-migrations-sequencing' of https://github.com/ericv…

b33c3e7

…ergnaud/ucx into view-migrations-sequencing

nfx reviewed Mar 28, 2024

View reviewed changes

rename fields

be076a4

ericvergnaud added 4 commits March 29, 2024 09:50

align code style

671453d

fix pylint issues

17f5419

Merge branch 'databrickslabs:main' into view-migrations-sequencing

6bf9af5

Merge branch 'view-migrations-sequencing' of https://github.com/ericv…

f456f0b

…ergnaud/ucx into view-migrations-sequencing

ganeshgirase suggested changes Mar 29, 2024

View reviewed changes

ericvergnaud and others added 2 commits March 29, 2024 10:24

Update src/databricks/labs/ucx/hive_metastore/views_migrator.py

2bd33f0

Co-authored-by: Ganesh Girase <ganeshgirase@gmail.com>

fix indentation broken by GH

6a33c1c

ericvergnaud changed the title ~~[WIP] Ensure proper sequencing of view migrations~~ Ensure proper sequencing of view migrations Mar 29, 2024

ericvergnaud requested a review from nfx March 29, 2024 09:35

nfx reviewed Mar 29, 2024

View reviewed changes

src/databricks/labs/ucx/hive_metastore/views_migrator.py Show resolved Hide resolved

nfx approved these changes Mar 29, 2024

View reviewed changes

ericvergnaud had a problem deploying to account-admin March 29, 2024 10:18 — with GitHub Actions Failure

ericvergnaud added 2 commits March 29, 2024 12:02

fix pylint warning

83ef5fe

fix typo

679d028

nfx merged commit 8a76764 into databrickslabs:main Mar 29, 2024
6 of 7 checks passed

ericvergnaud deleted the view-migrations-sequencing branch March 29, 2024 11:16

ericvergnaud had a problem deploying to account-admin March 29, 2024 11:40 — with GitHub Actions Failure

ericvergnaud mentioned this pull request Mar 29, 2024

[FEATURE]: Migrate views correctly #1172

Closed

1 task

nfx mentioned this pull request Mar 29, 2024

Release v0.21.0 #1185

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure proper sequencing of view migrations #1157

Ensure proper sequencing of view migrations #1157

ericvergnaud commented Mar 28, 2024

nfx Mar 28, 2024

ericvergnaud Mar 29, 2024

ericvergnaud Mar 29, 2024

nfx Mar 28, 2024

ericvergnaud Mar 29, 2024

nfx Mar 28, 2024

ericvergnaud Mar 29, 2024

nfx Mar 28, 2024

ericvergnaud Mar 29, 2024

nfx Mar 28, 2024

ericvergnaud Mar 29, 2024

nfx Mar 28, 2024

ericvergnaud Mar 29, 2024

nfx Mar 28, 2024

ericvergnaud Mar 29, 2024

ganeshgirase Mar 29, 2024

codecov bot commented Mar 29, 2024 •

edited

Loading

ericvergnaud commented Mar 29, 2024

ganeshgirase Mar 29, 2024

	if len(self.table_dependencies) + len(self.view_dependencies) == 0:
	if len(self.table_dependencies) + len(self.view_dependencies) > 0:
	return

Ensure proper sequencing of view migrations #1157

Ensure proper sequencing of view migrations #1157

Conversation

ericvergnaud commented Mar 28, 2024

Changes

Linked issues

Functionality

Tests

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 29, 2024 • edited Loading

Codecov Report

ericvergnaud commented Mar 29, 2024

Choose a reason for hiding this comment

codecov bot commented Mar 29, 2024 •

edited

Loading