Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give all access connectors Storage Blob Data Contributor role #1425

Merged
merged 1 commit into from
Apr 17, 2024

Conversation

JCZuurmond
Copy link
Contributor

@JCZuurmond JCZuurmond commented Apr 17, 2024

Changes

Give all access connectors STORAGE_BLOB_DATA_CONTRIBUTOR access.

More fine-grained access is configured within unity catalog. We give all access connectors (one for each storage account) the highest data access, i.e. data contributor.

Linked issues

Resolves #1383

Functionality

  • added relevant user documentation
  • added new CLI command
  • modified existing command: databricks labs ucx ...
  • added a new workflow
  • modified existing workflow: ...
  • added a new table
  • modified existing table: ...

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • verified on staging environment (screenshot attached)

@JCZuurmond JCZuurmond marked this pull request as ready for review April 17, 2024 08:31
@JCZuurmond JCZuurmond requested review from a team and dumitrac April 17, 2024 08:31
Copy link
Contributor

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nfx nfx changed the title Give all access connectors STORAGE_BLOB_DATA_CONTRIBUTOR access Give all access connectors Storage Blob Data Contributor role Apr 17, 2024
@nfx nfx merged commit 38e84e5 into main Apr 17, 2024
6 of 7 checks passed
@nfx nfx deleted the feature/give-access-connectors-appropiate-access branch April 17, 2024 14:48
ericvergnaud added a commit to ericvergnaud/ucx that referenced this pull request Apr 18, 2024
* main:
  Give all access connectors `Storage Blob Data Contributor` role (databrickslabs#1425)
  Addressed a bug with AWS UC Role Update. Adding unit tests. (databrickslabs#1429)
  Added integration tests with external HMS & Glue (databrickslabs#1408)
  Modified update existing role to amend the AssumeRole statement rather than rewriting it. (databrickslabs#1423)
  Extend service principal migration with option to create access connectors with managed identity for each storage account (databrickslabs#1417)
  A notebook linter to detect DBFS references within notebook cells (databrickslabs#1393)
  Remove `StaticTablesCrawler` in favor of created database tracking  (databrickslabs#1392)
  Cleaned up integration test suite (databrickslabs#1422)

# Conflicts:
#	tests/unit/source_code/test_notebook_linter.py
@nfx nfx mentioned this pull request Apr 26, 2024
nfx added a commit that referenced this pull request Apr 26, 2024
* A notebook linter to detect DBFS references within notebook cells ([#1393](https://github.com/databrickslabs/ucx/issues/1393)). A new linter has been implemented in the open-source library to identify references to Databricks File System (DBFS) mount points or folders within SQL and Python cells of Notebooks, raising Advisory or Deprecated alerts when detected. This feature, resolving issue [#1108](https://github.com/databrickslabs/ucx/issues/1108), enhances code maintainability by discouraging DBFS usage, and improves security by avoiding hard-coded DBFS paths. The linter's functionality includes parsing the code and searching for Table elements within statements, raising warnings when DBFS references are found. Implementation changes include updates to the `NotebookLinter` class, a new `from_source` class method, and an `original_offset` argument in the `Cell` class. The linter now also supports the `databricks` dialect for SQL code parsing. This feature improves the library's security and maintainability by ensuring better data management and avoiding hard-coded DBFS paths.
* Added CLI commands to trigger table migration workflow ([#1511](https://github.com/databrickslabs/ucx/issues/1511)). A new `migrate_tables` command has been added to the 'databricks.labs.ucx.cli' module, which triggers the `migrate-tables` workflow and, optionally, the `migrate-external-hiveserde-tables-in-place-experimental` workflow. The `migrate-tables` workflow is responsible for managing table migrations, while the `migrate-external-hiveserde-tables-in-place-experimental` workflow handles migrations for external hiveserde tables. The new `What` class from the 'databricks.labs.ucx.hive_metastore.tables' module is used to identify hiveserde tables. If hiveserde tables are detected, the user is prompted to confirm running the `migrate-external-hiveserde-tables-in-place-experimental` workflow. The `migrate_tables` command requires a WorkspaceClient and Prompts objects and accepts an optional WorkspaceContext object, which is set to the WorkspaceContext of the WorkspaceClient if not provided. Additionally, a new `migrate_external_hiveserde_tables_in_place` command has been added which will run the `migrate-external-hiveserde-tables-in-place-experimental` workflow if it finds any hiveserde tables, making it easier to manage table migrations from the command line.
* Added CSV, JSON and include path in mounts ([#1329](https://github.com/databrickslabs/ucx/issues/1329)). In this release, the TablesInMounts function has been enhanced to support CSV and JSON file formats, along with the existing Parquet and Delta table formats. The new `include_paths_in_mount` parameter has been introduced, enabling users to specify a list of paths to crawl within all mounts. The WorkspaceConfig class in the config.py file has been updated to accommodate these changes. Additionally, a new `_assess_path` method has been introduced to assess the format of a given file and return a `TableInMount` object accordingly. Several existing methods, such as `_find_delta_log_folders`, `_is_parquet`, `_is_csv`, `_is_json`, and `_path_is_delta`, have been updated to reflect these improvements. Furthermore, two new unit tests, `test_mount_include_paths` and `test_mount_listing_csv_json`, have been added to ensure the proper functioning of the TablesInMounts function with the new file formats and the `include_paths_in_mount` parameter. These changes aim to improve the functionality and flexibility of the TablesInMounts library, allowing for more precise crawling and identification of tables based on specific file formats and paths.
* Added CTAS migration workflow for external tables cannot be in place migrated ([#1510](https://github.com/databrickslabs/ucx/issues/1510)). In this release, we have added a new CTAS (Create Table As Select) migration workflow for external tables that cannot be migrated in-place. This feature includes a `MigrateExternalTablesCTAS` class with three tasks to migrate non-SYNC supported and non-HiveSerde external tables, migrate HiveSerde tables, and migrate views from the Hive Metastore to the Unity Catalog. We have also added new methods for managed and external table migration, deprecated old methods, and added a new test function to ensure proper CTAS migration for external tables using HiveSerDe. This change also introduces a new JSON file for external table configurations and a mock backend to simulate the Hive Metastore and test the migration process. Overall, these changes improve the migration capabilities for external tables and ensure a more flexible and reliable migration process.
* Added Python linter for table creation with implicit format ([#1435](https://github.com/databrickslabs/ucx/issues/1435)). A new linter has been added to the Python library to advise on implicit table formats when the 'writeTo', 'table', 'insertInto', or `saveAsTable` methods are invoked without an explicit format specified in the same chain of calls. This feature is useful for software engineers working with Databricks Runtime (DBR) v8.0 and later, where the default table format changed from `parquet` to 'delta'. The linter, implemented in 'table_creation.py', utilizes reusable AST utilities from 'python_ast_util.py' and is not automated, providing advice instead of fixing the code. The linter skips linting when a DRM version of 8.0 or higher is passed, as the default format change only applies to versions prior to 8.0. Unit tests have been added for both files as part of the code migration workflow.
* Added Support for Migrating Table ACL of Interactive clusters using SPN ([#1077](https://github.com/databrickslabs/ucx/issues/1077)). This change introduces support for migrating table Access Control Lists (ACLs) of interactive clusters using a Security Principal Name (SPN) for Azure Databricks environments in the UCX project. It includes modifications to the `hive_metastore` and `workspace_access` modules, as well as the addition of new classes, methods, and import statements for handling ACLs and grants. This feature enables more secure and granular control over table permissions when using SPN authentication for interactive clusters in Azure. This will benefit software engineers working with interactive clusters in Azure Databricks by enhancing security and providing more control over data access.
* Added Support for migrating Schema/Catalog ACL for Interactive cluster ([#1413](https://github.com/databrickslabs/ucx/issues/1413)). This commit adds support for migrating schema and catalog ACLs for interactive clusters, specifically for AWS and Azure, with partial fixes for issues [#1192](https://github.com/databrickslabs/ucx/issues/1192) and [#1193](https://github.com/databrickslabs/ucx/issues/1193). The changes identify and filter database ACL grants, create mappings from Hive metastore schema to Unity Catalog schema and catalog, and replace Hive metastore actions with equivalent Unity Catalog actions for both schema and catalog. External location permission is not included in this commit and will be addressed separately. New methods for creating mappings, updating principal ACLs, and getting catalog schema grants have been added, and existing functionalities have been modified to handle both AWS and Azure. The code has undergone manual testing and passed unit and integration tests. The changes are targeted towards software engineers who adopt the project.
* Added `databricks labs ucx logs` command ([#1350](https://github.com/databrickslabs/ucx/issues/1350)). A new command, 'databricks labs ucx logs', has been added to the open-source library to enhance logging and debugging capabilities. This command allows developers and administrators to view logs from the latest job run or specify a particular workflow name to display its logs. By default, logs with levels of INFO, WARNING, and ERROR are shown, but the --debug flag can be used for more detailed DEBUG logs. This feature utilizes the relay_logs method from the deployed_workflows object in the WorkspaceContext class and addresses issue [#1282](https://github.com/databrickslabs/ucx/issues/1282). The addition of this command aims to improve the usability and maintainability of the framework, making it easier for users to diagnose and resolve issues.
* Added check for DBFS mounts in SQL code ([#1351](https://github.com/databrickslabs/ucx/issues/1351)). A new feature has been introduced to check for Databricks File System (DBFS) mounts within SQL code, enhancing data management and accessibility in the Databricks environment. The `dbfsqueries.py` file in the `databricks/labs/ucx/source_code` directory now includes a function that verifies the presence of DBFS mounts in SQL queries and returns appropriate messages. The `Languages` class in the `__init__` method has been updated to incorporate a new class, `FromDbfsFolder`, which replaces the existing `from_table` linter with a new linter, `DBFSUsageLinter`, for handling DBFS usage in SQL code. In addition, a Staff Software Engineer has improved the functionality of a DBFS usage linter tool by adding new methods to check for deprecated DBFS mounts in SQL code, returning deprecation warnings as needed. These enhancements ensure more robust handling of DBFS mounts throughout the system, allowing for better integration and management of DBFS-related issues in SQL-based operations.
* Added check for circular view dependency ([#1502](https://github.com/databrickslabs/ucx/issues/1502)). A circular view dependency check has been implemented to prevent issues caused by circular dependencies in views. This includes a new test for chained circular dependencies (A->B, B->C, C->A) and an update to the existing circular view dependency test. The checks have been implemented through modifications to the tests in `test_views_sequencer.py`, including a new test method and an update to the existing test method. If any circular dependencies are encountered during migration, a ValueError with an error message will be raised. These changes include updates to the `tables_and_views.json` file, with the addition of a new view `v12` that depends on `v11`, creating a circular dependency. The changes have been tested through the addition of unit tests and are expected to function as intended. No new methods have been added, but changes have been made to the existing `_next_batch` method and two new methods, `_check_circular_dependency` and `_get_view_instance`, have been introduced.
* Added commands for metastores listing & assignment ([#1489](https://github.com/databrickslabs/ucx/issues/1489)). This commit introduces new commands for handling metastores in the Databricks Labs Unity Catalog (UCX) tool, which enables more efficient management of metastores. The `databricks labs ucx assign-metastore` command automatically assigns a metastore to a specified workspace when possible, while the `databricks labs ucx show-all-metastores` command displays all possible metastores that can be assigned to a workspace. These changes include new methods for handling metastores in the account and workspace classes, as well as new user documentation, manual testing, and unit tests. The new functionality is added to improve the usability and efficiency of the UCX tool in handling metastores. Additional information on the UCX metastore commands is provided in the README.md file.
* Added functionality to migrate external tables using Create Table (No Sync) ([#1432](https://github.com/databrickslabs/ucx/issues/1432)). A new feature has been implemented for migrating external tables in Databricks' Hive metastore using the "Create Table (No Sync)" method. This feature includes the addition of two new methods, `_migrate_non_sync_table` and `_get_create_in_place_sql`, for handling migration and SQL query generation. The existing methods `_migrate_dbfs_root_table` and `_migrate_acl` have also been updated. A test case has been added to demonstrate migration of external tables while preserving their location and properties. This new functionality provides more flexibility in managing migrations for specific use cases. The SQL parsing library sqlglot has been utilized to replace the current table name with the updated catalog and change the CREATE statement to CREATE IF NOT EXISTS. This increases the efficiency and security of migrating external tables in the Databricks' Hive metastore.
* Added initial version of account-level installer ([#1339](https://github.com/databrickslabs/ucx/issues/1339)). A new account-level installer has been added to the UCX library, allowing account administrators to install UCX on all workspaces within an account in a single operation. The installer authenticates to the account, prompts the user for configuration of the first workspace, and then runs the installation and offers to repeat the process for all remaining workspaces. This is achieved through the creation of a new `prompt_for_new_installation` method which saves user responses to a new `InstallationConfig` data class, allowing for reuse in other workspaces. The existing `databricks labs install ucx` command now supports account-level installation when the `UCX_FORCE_INSTALL` environment variable is set to 'account'. The changes have been manually tested and include updates to documentation and error handling for `PermissionDenied`, `NotFound`, and `ValueError` exceptions. Additionally, a new `AccountInstaller` class has been added to manage the installation process at the account level.
* Added linting for DBFS usage ([#1341](https://github.com/databrickslabs/ucx/issues/1341)). A new linter, "DBFSUsageLinter", has been added to our open-source library to check for deprecated file system paths in Python code, specifically for Database File System (DBFS) usage. Implemented as part of the "databricks.labs.ucx.source_code" package in the "languages.py" file, this linter defines a visitor, "DetectDbfsVisitor", that detects file system paths in the code and checks them against a list of known deprecated paths. If a match is found, it creates a Deprecation or Advisory object with information about the deprecated code, including the line number and column offset, and adds it to a list. This feature will assist in identifying and removing deprecated file system paths from the codebase, ensuring consistent and proper use of DBFS within the project.
* Added log task to parse logs and store the logs in the ucx database ([#1272](https://github.com/databrickslabs/ucx/issues/1272)). A new log task has been added to parse logs and store them in the ucx database, added as a log crawler task to all workflows after other tasks have completed. The LogRecord has been updated to include all necessary fields, and logs below a certain minimum level will no longer be stored. A new CLI command to retrieve errors and warnings from the latest workflow run has been added, while existing commands and workflows have been modified. User documentation has been updated, and new methods have been added for log parsing and storage. A new table called `logs` has been added to the database, and unit and integration tests have been added to ensure functionality. This change also resolves issues [#1148](https://github.com/databrickslabs/ucx/issues/1148) and [#1283](https://github.com/databrickslabs/ucx/issues/1283), with modifications to existing classes such as RuntimeContext, TaskRunWarningRecorder, and LogRecord, and the addition of new classes and methods including HiveMetastoreLineageEnabler and LogRecord in the logs.py file. The deploy_schema function has been updated to include the new table, and the existing command `databricks labs ucx` has been modified to accommodate the new log functionality. Existing workflows have been updated and a new workflow has been added, all of which are tested through unit tests, integration tests, and manual testing. The `TaskLogger` class and `TaskRunWarningRecorder` class are used to log and record task run data, with the `parse_logs` method used to parse log files into partial log records, which are then used to create snapshot rows in the `logs` table.
* Added migration for non delta dbfs tables using Create Table As Select (CTAS). Convert such tables to Delta tables ([#1434](https://github.com/databrickslabs/ucx/issues/1434)). In this release, we've developed new methods to migrate non-Delta DBFS root tables to managed Delta tables, enhancing compatibility with various table formats and configurations. We've added support for safer SQL statement generation in our Create Table As Select (CTAS) functionality and incorporated new creation methods. Additionally, we've introduced grant assignments during the migration process and updated integration tests. The changes include the addition of a `TablesMigrator` class with an updated `migrate_tables` method, a new `PrincipalACL` parameter, and the `test_dbfs_non_delta_tables_should_produce_proper_queries` function to test the migration of non-Delta DBFS tables to managed Delta tables. These improvements promote safer CTAS functionality and expanded compatibility for non-Delta DBFS root tables.
* Added support for %pip cells ([#1401](https://github.com/databrickslabs/ucx/issues/1401)). A new cell type, %pip, has been introduced to the notebook interface, allowing for the execution of pip commands within the notebook. The new class, PipCell, has been added with several methods, including is_runnable, build_dependency_graph, and migrate_notebook_path, enabling the notebook interface to recognize and handle pip cells differently from other cell types. This allows for the installation of Python packages directly within a notebook setting, enhancing the notebook environment and providing users with the ability to dynamically install necessary packages as they work. The new sample notebook file demonstrates the installation of a package using the %pip install command. The implementation includes modifying the notebook runtime to recognize and execute %pip cells, and installing packages in a manner consistent with standard pip installation processes. Additionally, a new tuple, PIP_NOTEBOOK_SAMPLE, has been added to the existing test notebook sample tuple list, enabling testing the handling of %pip cells during notebook splitting.
* Added support for %sh cells ([#1400](https://github.com/databrickslabs/ucx/issues/1400)). A new `SHELL` CellLanguage has been implemented to support %sh cells, enabling the execution of shell commands directly within the notebook interface. This enhancement, addressing issue [#1400](https://github.com/databrickslabs/ucx/issues/1400) and linked to [#1399](https://github.com/databrickslabs/ucx/issues/1399) and [#1202](https://github.com/databrickslabs/ucx/issues/1202), streamlines the process of running shell scripts in the notebook, eliminating the need for external tools. The new SHELL_NOTEBOOK_SAMPLE tuple, part of the updated test suite, demonstrates the feature's functionality with a shell cell, while the new methods manage the underlying mechanics of executing these shell commands. These changes not only extend the platform's capabilities by providing built-in support for shell commands but also improve productivity and ease-of-use for teams relying on shell commands as part of their data processing and analysis pipelines.
* Added support for migrating Table ACL for interactive cluster in AWS using Instance Profile ([#1285](https://github.com/databrickslabs/ucx/issues/1285)). This change adds support for migrating table access control lists (ACLs) for interactive clusters in AWS using an Instance Profile. A new method `get_iam_role_from_cluster_policy` has been introduced in the `AwsACL` class, which replaces the static method `_get_iam_role_from_cluster_policy`. The `create_uber_principal` method now uses this new method to obtain the IAM role name from the cluster policy. Additionally, the project now includes AWS Role Action and AWS Resource Permissions to handle permissions for migrating table ACLs for interactive clusters in AWS. New methods and classes have been added to support AWS-specific functionality and handle AWS instance profile information. Two new tests have been added to tests/unit/test_cli.py to test various scenarios for interactive clusters with and without ACL in AWS. A new argument `is_gcp` has been added to WorkspaceContext to differentiate between Google Cloud Platform and other cloud providers.
* Added support for views in `table-migration` workflow ([#1325](https://github.com/databrickslabs/ucx/issues/1325)). A new `MigrationStatus` class has been added to track the migration status of tables and views in a Hive metastore, and a `MigrationIndex` class has been added to check if a table or view has been migrated or not. The `MigrationStatusRefresher` class has been updated to use a new approach for migrating tables and views, and is now responsible for refreshing the migration status of tables and indexing it using the `MigrationIndex` class. A `ViewsMigrationSequencer` class has also been introduced to sequence the migration of views based on dependencies. These changes improve the migration process for tables and views in the `table-migration` workflow.
* Added workflow for in-place migrating external Parquet, Orc, Avro hiveserde tables ([#1412](https://github.com/databrickslabs/ucx/issues/1412)). This change introduces a new workflow, `MigrateHiveSerdeTablesInPlace`, for in-place upgrading external Parquet, Orc, and Avro hiveserde tables to the Unity Catalog. The workflow includes new functions to describe the table and extract hiveserde details, update the DDL from `show create table`, and replace the old table name with the migration target and DBFS mount table location if any. A new function `_migrate_external_table_hiveserde` has been added to `table_migrate.py`, and two new arguments, `mounts` and `hiveserde_in_place_migrate`, have been added to the `TablesMigrator` class. These arguments control which hiveserde to migrate and replace the DBFS mnt table location if any, enabling multiple tasks to run in parallel and migrate only one type of hiveserde at a time. This feature does not include user documentation, new CLI commands, or changes to existing commands, but it does add a new workflow and modify the existing `migrate_tables` function in `table_migrate.py`. The changes have been manually tested, but no unit tests, integration tests, or staging environment verification have been provided.
* Build dependency graph for local files ([#1462](https://github.com/databrickslabs/ucx/issues/1462)). This commit refactors dependency classes to distinguish between resolution and loading, and introduces new classes to handle different types of dependencies. A new method, `LocalFileMigrator.build_dependency_graph`, is implemented, following the pattern of `NotebookMigrator`, to build a dependency graph for local files. This resolves issue [[#1202](https://github.com/databrickslabs/ucx/issues/1202)](https://github.com/databrickslabs/ucx/issues/1202) and addresses issue [[#1360](https://github.com/databrickslabs/ucx/issues/1360)](https://github.com/databrickslabs/ucx/issues/1360). While the refactoring and implementation of new methods improve the accuracy of dependency graphs and ensure that dependencies are correctly registered based on the file's language, there are no user-facing changes, such as new or modified CLI commands, tables, or workflows. Unit tests are added to ensure that the new changes function as expected.
* Build dependency graph for site packages ([#1504](https://github.com/databrickslabs/ucx/issues/1504)). This commit introduces changes to the dependency graph building process for site packages within the ucx project. When a package is not recognized, package files are added as dependencies to prevent errors during import dependency determination, thereby fixing an infinite loop issue when encountering cyclical graphs. This resolves issues [#1427](https://github.com/databrickslabs/ucx/issues/1427) and is related to [#1202](https://github.com/databrickslabs/ucx/issues/1202). The changes include adding new methods for handling package files as dependencies and preventing infinite loops when visiting cyclical graphs. The `SitePackage` class in the `site_packages.py` file has been updated to handle package files more accurately, with the `__init__` method now accepting `module_paths` as a list of Path objects instead of a list of strings. A new method, `module_paths`, has also been introduced. Unit tests have been added to ensure the correct functionality of these changes, and a hack in the PR will be removed once issue [#1421](https://github.com/databrickslabs/ucx/issues/1421) is implemented.
* Build notebook dependency graph for `%run` cells ([#1279](https://github.com/databrickslabs/ucx/issues/1279)). A new `Notebook` class has been developed to parse source code and split it into cells, and a `NotebookDependencyGraph` class with related utilities has been added to discover dependencies in `%run` cells, addressing issue [#1201](https://github.com/databrickslabs/ucx/issues/1201). The new functionality enhances the management and tracking of dependencies within notebooks, improving code organization and efficiency. The commit includes updates to existing notebooks to utilize the new classes and methods, with no impact on existing functionality outside of the `%run` context.
* Create UC External Location, Schema, and Table Grants based on workspace-wide Azure SPN mount points ([#1374](https://github.com/databrickslabs/ucx/issues/1374)). This change adds new functionality to create Unity Catalog (UC) external location, schema, and table grants based on workspace-wide Azure Service Principal Names (SPN) mount points. The majority of the work was completed in a previous pull request. The main change in this pull request is the addition of a new test function, `test_migrate_external_tables_with_principal_acl_azure`, which tests the migration of tables with principal ACLs in an Azure environment. This function includes the creation of a new user with cluster access, another user without cluster access, and a new group with cluster access to validate the migration of table grants to these entities. The `make_cluster_permissions` method now accepts a `service_principal_name` parameter, and after migrating the tables with the `acl_strategy` set to `PRINCIPAL`, the function checks if the appropriate grants have been assigned to the Azure SPN. This change is part of an effort to improve the integration of Unity Catalog with Azure SPNs and is accessible through the UCX CLI command. The changes have been tested through manual testing, unit tests, and integration tests and have been verified in a staging environment.
* Detect DBFS use in SQL statements in notebooks ([#1372](https://github.com/databrickslabs/ucx/issues/1372)). A new linter has been added to detect and discourage the use of DBFS (Databricks File System) in SQL statements within notebooks. This linter raises deprecated advisories for any identified DBFS folder or mount point references in SQL statements, encouraging the use of alternative storage options. The change is implemented in the `NotebookLinter` class of the 'notebook_linter.py' file, and is tested through unit tests to ensure proper functionality. The target audience for this update includes software engineers who use Databricks or similar platforms, as the new linter will help users transition away from using DBFS in their SQL statements and adopt alternative storage methods.
* Detect `sys.path` manipulation ([#1380](https://github.com/databrickslabs/ucx/issues/1380)). A change has been introduced to the Python linter to detect manipulation of `sys.path`. New classes, AbsolutePath and RelativePath, have been added as subclasses of SysPath. The SysPathVisitor class has been implemented to track additions to sys.path and the visit_Call method in SysPathVisitor checks for 'sys.path.append' and 'os.path.abspath' calls. The new functionality includes a new method, collect_appended_sys_paths in PythonLinter, and a static method, list_appended_sys_paths, to retrieve the appended paths. Additionally, new tests have been added to the PythonLinter to detect manipulation of the `sys.path` variable, specifically the `list_appended_sys_paths` method. The new test cases include using aliases for `sys`, `os`, and `os.path`, and using both absolute and relative paths. This improvement will enhance the linter's ability to detect potential issues related to manipulation of the `sys.path` variable. The change resolves issue [#1379](https://github.com/databrickslabs/ucx/issues/1379) and is linked to issue [#1202](https://github.com/databrickslabs/ucx/issues/1202). No user documentation or CLI commands have been added or modified, and no manual testing has been performed. Unit tests for the new functionality have been added.
* Detect direct access to cloud storage and raise a deprecation warning ([#1506](https://github.com/databrickslabs/ucx/issues/1506)). In this release, the Pyspark linter has been enhanced to detect and issue deprecation warnings for direct access to cloud storage. This change, which resolves issue [#1133](https://github.com/databrickslabs/ucx/issues/1133), introduces new classes `AstHelper` and `TableNameMatcher` to determine the fully-qualified name of functions and replace instances of direct cloud storage access with migration index table names. Instances of direct access using 'dbfs:/', 'dbfs://', and default 'dbfs:' references will now be detected and flagged with a deprecation warning. The test file `test_pyspark.py` has been updated to include new tests for detecting direct cloud storage access. Users should be aware of these changes when updating their code to avoid deprecation warnings.
* Detect imported files and packages ([#1362](https://github.com/databrickslabs/ucx/issues/1362)). This commit introduces functionality to parse Python code for `import` and `import from` processing instructions, enabling the detection and management of imported files and packages. It includes a new CLI command, modifications to existing commands, new and updated workflows, and additional tables. The code modifications include new methods for visiting Import and ImportFrom nodes, and the addition of unit tests to ensure correctness. Relevant user documentation has been added, and the new functionality has been tested through manual testing, unit tests, and verification on a staging environment. This comprehensive update enhances dependency management, code organization, and understanding for a more streamlined user experience.
* Enhanced migrate views task to support views created with explicit column list ([#1375](https://github.com/databrickslabs/ucx/issues/1375)). The commit enhances the migrate views task to better support handling of views with an explicit column list, improving overall compatibility. A new lookup based on `SHOW CREATE TABLE` has been added to extract the column list from the create script, ensuring accurate migration. The `_migrate_view_table` method has been refactored, and a new `_sql_migrate_view` method is added to fetch the create statement of the view. The `ViewToMigrate` class has been updated with a new `_view_dependencies` method to determine view dependencies in the new SQL text. Additionally, new methods `safe_sql_key` and `add_table` have been introduced, and the `sqlglot.parse` method is used to parse the code with `databricks` as the read argument. A new test for migrating views with an explicit column list has been added, along with the `upgraded_from` and `upgraded_to` table properties, and the migration status is updated to reflect successful migration. New test functions have also been added to test the migration of views with columns and ACLs. Dependency sqlglot has been updated to version ~=23.9.0, enhancing the overall functionality and compatibility of the migrate views task.
* Ensure that USE statements are recognized and apply to table references without a qualifying schema in SQL and pyspark ([#1433](https://github.com/databrickslabs/ucx/issues/1433)). This commit enhances the library's functionality in handling `USE` statements in both SQL and PySpark by ensuring they are recognized and applied to table references without a qualifying schema. A new `CurrentSessionState` class is introduced to manage the current schema of a session, and existing classes such as `FromTable` and `TableNameMatcher` are updated to use this new class. Additionally, the `lint` and `apply` methods have been updated to handle `USE` statements and improve the precision of table reference handling. These changes are particularly useful when working with tables in different schemas, ensuring the library can manage table references more accurately in SQL and PySpark. A new fixture, 'extended_test_index', has been added to support unit tests, and the test file 'test_notebook.py' has been updated to better reflect the intended schema for each table reference.
* Expand documentation for end to end workflows with external HMS ([#1458](https://github.com/databrickslabs/ucx/issues/1458)). The UCX toolkit has been updated to support integration with an external Hive Metastore (HMS), in addition to the default workspace HMS. This feature allows users to easily set up UCX to work with an existing external HMS, providing greater flexibility in managing and accessing data. During installation, UCX will scan for evidence of an external HMS in the cluster policies and Spark configurations. If found, UCX will prompt the user to connect to the external HMS, create a new policy with the necessary Spark and data access configurations, and set up job clusters accordingly. However, users will need to manually update the data access configuration for SQL Warehouses that are not configured for external HMS. Users can also create a cluster policy with appropriate Spark configurations and data access for external HMS, or edit existing policies in specified UCX workflows. Once set up, the assessment workflow will scan tables and views from the external HMS, and the table migration workflow will upgrade tables and views from the external HMS to the Unity Catalog. Users should note that if the external HMS is shared between multiple workspaces, a different inventory database name should be specified for each UCX installation. It is important to plan carefully when setting up a workspace with multiple external HMS, as the assessment dashboard will fail if the SQL warehouse is not configured correctly. Users can have multiple UCX installations in a workspace, each set up with a different external HMS, or manually modify the cluster policy and SQL data access configuration to point to the correct external HMS after UCX has been installed.
* Extend service principal migration with option to create access connectors with managed identity for each storage account ([#1417](https://github.com/databrickslabs/ucx/issues/1417)). This commit extends the service principal migration feature to create access connectors with managed identities for each storage account, enhancing security and isolation by preventing cross-account access. A new CLI command has been added, and an existing command has been modified. The `create_access_connectors_for_storage_accounts` method creates access connectors with the required permissions for each storage account used in external tables. The `_apply_storage_permission` method has also been updated. New unit and integration tests have been included, covering various scenarios such as secret value decoding, secret read exceptions, and single storage account testing. The necessary permissions for these connectors will be set in a subsequent pull request. Additionally, a new method, `azure_resources_list_access_connectors`, and `azure_resources_get_access_connector` have been introduced to ensure access connectors are returned as expected. This change has been tested manually and through automated tests, ensuring backward compatibility while providing improved security features.
* Fixed UCX policy creation when instance pool is specified ([#1457](https://github.com/databrickslabs/ucx/issues/1457)). In this release, we have made significant improvements to the handling of instance pools in UCX policy creation. The `policy.py` file has been updated to properly handle the case when an instance pool is specified, by setting the `instance_pool_id` attribute and removing the `node_type_id` attribute in the policy definition. Additionally, the availability attribute has been removed for all cloud providers, including AWS, Azure, and GCP, when an instance pool ID is provided. A new `pop` method call has also been added to remove the `gcp_attributes.availability` attribute when an instance pool ID is provided. These changes ensure consistency in the policy definition across all cloud providers. Furthermore, tests for this functionality have been updated in the 'test_policy.py' file, specifically the `test_cluster_policy_instance_pool` function, to check the correct addition of the instance pool to the cluster policy. The purpose of these changes is to improve the reliability and functionality of UCX policy creation, specifically when an instance pool is specified.
* Fixed `migrate-credentials` command on aws ([#1501](https://github.com/databrickslabs/ucx/issues/1501)). In this release, the `migrate-credentials` command for the `labs.yml` configuration file has been updated to include new flags for specifying a subscription ID and AWS profile. This allows users to scan a specific storage account and authenticate using a particular AWS profile when migrating credentials for storage access to UC storage credentials. The `create-account-groups` command remains unchanged. Additionally, several issues related to the `migrate-credentials` command for AWS have been addressed, such as hallucinating the presence of a `--profile` flag, using a monotonically increasing role ID, and not handling cases where there are no IAM roles to migrate. The `run` method of the `AwsUcStorageCredentials` class has been updated to handle these cases, and several test functions have been added or updated to ensure proper functionality. These changes improve the functionality and robustness of the `migrate-credentials` command for AWS.
* Fixed edge case for `RegexSubStrategy` ([#1561](https://github.com/databrickslabs/ucx/issues/1561)). In this release, we have implemented fixes for the `RegexSubStrategy` class within the `GroupMigrationStrategy`, addressing an issue where matching account groups could not be found using the display name. The `generate_migrated_groups` function has been updated to include a check for account groups with matching external IDs when either the display name or regex substitution of the display name fails to yield a match. Additionally, we have expanded testing for the `GroupManager` class, which handles group management. This includes new tests using regular expressions to match groups, and ensuring that the `GroupManager` class can correctly identify and manage groups based on different criteria such as the group's ID, display name, or external ID. These changes improve the robustness of the `GroupMigrationStrategy` and ensure the proper functioning of the `GroupManager` class when using regular expression substitution and matching.
* Fixed table in mount partition scans for JSON and CSV ([#1437](https://github.com/databrickslabs/ucx/issues/1437)). This release introduces a fix for an issue where table scans on partitioned CSV and JSON files were not being correctly identified. The `TablesInMounts` scan function has been updated to accurately detect these files, addressing the problem reported in issue [#1389](https://github.com/databrickslabs/ucx/issues/1389) and linked issue [#1437](https://github.com/databrickslabs/ucx/issues/1437). To ensure functionality, new private methods `_find_partition_file_format` and `_assess_path` have been introduced, with the latter updated to handle partitioned directories. Additionally, unit tests have been added to test partitioned CSVs and JSONs, simulating the file system's response to various calls. These changes provide enhanced detection and handling of partitioned CSVs and JSONs in the `TablesInMounts` scan function.
* Forward remote logs on `run_workflow` and removed `destroy-schema` workflow in favour of `databricks labs uninstall ucx` ([#1349](https://github.com/databrickslabs/ucx/issues/1349)). In this release, the `destroy-schema` workflow has been removed and replaced with the `databricks labs uninstall ucx` command, addressing issue [#1186](https://github.com/databrickslabs/ucx/issues/1186). The `run_workflow` function has been updated to forward remote logs, and the `run_task` function now accepts a new argument `sql_backend`. The `Task` class includes a new method `is_testing()` and has been updated to use `RuntimeBackend` before `SqlBackend` in the `databricks.labs.lsql.backends` module. The `TaskLogger` class has been modified to include a new argument `attempt` and a new class method `log_path()`. The `verify_metastore` method in the `verification.py` file has been updated to handle `PermissionDenied` exceptions more gracefully. The `destroySchema` class and its `destroy_schema` method have been removed. The `workflow_task.py` file has been updated to include a new argument `attempt` in the `task_run_warning_recorder` method. These changes aim to improve the system's efficiency, error handling, and functionality.
* Give all access connectors `Storage Blob Data Contributor` role ([#1425](https://github.com/databrickslabs/ucx/issues/1425)). A new change has been introduced to grant the `Storage Blob Data Contributor` role, which provides the highest level of data access, to all access connectors for each storage account in the system. This adjustment, part of issue [#142](https://github.com/databrickslabs/ucx/issues/142)
* Grant uber principal write permissions so that SYNC command will succeed ([#1505](https://github.com/databrickslabs/ucx/issues/1505)). A change has been implemented to modify the `databricks labs ucx create-uber-principal` command, granting the uber principal write permissions on Azure Blob Storage. This aligns with the existing implementation on AWS where the uber principal has write access to all S3 buckets. The modification includes the addition of a new role, "STORAGE_BLOB_DATA_CONTRIBUTOR", to the `_ROLES` dictionary in the `resources.py` file. A new method, `clean_up_spn`, has also been added to clear ucx uber service principals. This change resolves issue [#939](https://github.com/databrickslabs/ucx/issues/939) and ensures consistent behavior with AWS, enabling the uber principal to have write permissions on all Azure blob containers and ensuring the success of the `SYNC` command. The changes have been manually tested but not yet verified on a staging environment.
* Handled new output format of `SHOW TBLPROPERTIES` command ([#1381](https://github.com/databrickslabs/ucx/issues/1381)). A recent commit has been made to address an issue with the `test_revert_migrated_table` test failing due to the new output format of the `SHOW TBLPROPERTIES` command in the open-source library. Previously, the output was blank if a table property was missing, but now it shows a message indicating that the table does not have the specified property. The commit updates the `is_migrated` method in the `migration_status.py` file to handle this new output format, where the method now uses the `fetch` method to retrieve the `upgraded_to` property for a given schema and table. If the property is missing, the method will continue to the next table. The commit also updates tests for the changes, including a manual test that has not been verified on a staging environment. Changes have been made in the `test_table_migrate.py` file, where rows with table properties have been updated to return new data, and the `timestamp` function now sets the `datetime.datetime` to a `FakeDate`. No new methods have been added, and existing functionality related to `SHOW TBLPROPERTIES` command output handling has been changed in scope.
* Ignore whitelisted imports ([#1367](https://github.com/databrickslabs/ucx/issues/1367)). This commit introduces a new class `DependencyResolver` that filters Python import dependencies based on a whitelist, and updates to the `DependencyGraph` class to support this new resolver. A new optional parameter `resolver` has been added to the `NotebookMigrator` class constructor and the `DependencyGraph` constructor. A new file `whitelist.py` has been added, introducing classes and functions for defining and managing a whitelist of Python packages based on their name and version. These changes aim to improve control over which dependencies are included in the dependency graph, contributing to a more modular and maintainable codebase.
* Increased memory for ucx clusters ([#1366](https://github.com/databrickslabs/ucx/issues/1366)). This release introduces an update to enhance memory configuration for UCX clusters, addressing issue [#1366](https://github.com/databrickslabs/ucx/issues/1366). The main change involves a new method for selecting a node type with a minimum of 16GB of memory and local disk enabled, implemented in the policy.py file of the installer module. This modification results in the `node_type_id` parameter for creating clusters, instance pools, and pipelines now requiring a minimum memory of 16 GB. This change is reflected in the fixtures.py file, `ws.clusters.select_node_type()`, `ws.instance_pools.create()`, and `pipelines.PipelineCluster` method calls, ensuring that any newly created clusters, instance pools, and pipelines benefit from the increased memory allocation. This update aims to improve user experience by offering higher memory configurations out-of-the-box for UCX-related workloads.
* Integrate detection of notebook dependencies ([#1338](https://github.com/databrickslabs/ucx/issues/1338)). In this release, the NotebookMigrator has been updated to integrate dependency graph construction for detecting notebook dependencies, addressing issues 1204, 1286, and 1326. The changes include modifying the NotebookMigrator class to include the dependency graph and updating relevant tests. A new file, python_linter.py, has been added for linting Python code, which now detects calls to "dbutils.notebook.run" with dynamic paths. The linter uses the ast module to parse the code and locate nodes matching the specified criteria. The NotebookMigrator's apply method has been updated to check for ObjectType.NOTEBOOK, loading the notebook using the new _load_notebook method, and incorporating a new _apply method for modifying the code in the notebook based on applicable fixes. A new DependencyGraph class has been introduced to build a graph of dependencies within the notebook, and several new methods have been added, including _load_object, _load_notebook_from_path, and revert. This release is co-authored by Cor and aims to improve dependency management in the notebook system.
* Isolate grants computation when migrating tables ([#1233](https://github.com/databrickslabs/ucx/issues/1233)). In this release, we have implemented a change to improve the reliability of table migrations. Previously, grants to migrate were computed and snapshotted outside the loop that iterates through tables to migrate, which could lead to inconsistencies if the grants or migrated groups changed during migration. Now, grants are re-computed for each table, reducing the chance of such issues. We have introduced a new method `_compute_grants` that takes in the table to migrate, ACL strategy, and snapshots of all grants to migrate, migrated groups, and principal grants. If `acl_strategy` is `None`, it defaults to an empty list. The method checks each strategy in the ACL strategy list, extending the `grants` list if the strategy is `AclMigrationWhat.LEGACY_TACL` or `AclMigrationWhat.PRINCIPAL`. The `migrate_tables` method has been updated to use this new method to compute grants. It first checks if `acl_strategy` is `None`, and if so, sets it to an empty list. It then calls `_compute_grants` with the current table, `acl_strategy`, and the snapshots of all grants to migrate, migrated groups, and principal grants. The computed grants are then used to migrate the table. This change enhances the robustness of the migration process by isolating grants computation for each table.
* Log more often from workflows ([#1348](https://github.com/databrickslabs/ucx/issues/1348)). In this update, the log formatting for the debug log file in the "tasks.py" file of the "databricks/labs/ucx/framework" module has been modified. The `TimedRotatingFileHandler` function has been adjusted to rotate the log file every minute, increasing the frequency of log file rotation from every 10 minutes. Furthermore, the logging format has been enhanced to include the time, level name, name, thread name, and message. These improvements are in response to issue [#1171](https://github.com/databrickslabs/ucx/issues/1171) and the implementation of more frequent logging as per issue [#1348](https://github.com/databrickslabs/ucx/issues/1348), ensuring more detailed and up-to-date logs for debugging and analysis purposes.
* Make `databricks labs ucx assign-metastore` prompt for workspace if no workspace id provided ([#1500](https://github.com/databrickslabs/ucx/issues/1500)). The `databricks labs ucx assign-metastore` command has been updated to allow for a optional `workspace_id` parameter, with a prompt for the workspace ID displayed if it is not provided. Both the `assign-metastore` and `show-all-metastores` commands have been made account-level only. The functionality of the `migrate_local_code` function remains unchanged. Error handling for etag issues related to default catalog settings has been implemented. Unit tests and manual testing have been conducted on a staging environment to verify the changes. The `show_all_metastores` and `assign_metastore` commands have been updated to accept an optional `workspace_id` parameter. The unit tests cover various scenarios, including cases where a user has multiple metastores and needs to select one, as well as cases where a default catalog name is provided and needs to be selected. If no metastore is found, a `ValueError` will be raised. The `metastore_id` and `workspace_id` flags in the yml file have been renamed to `metastore-id` and `workspace-id`, respectively, and a new `default-catalog` flag has been added.
* Modified update existing role to amend the AssumeRole statement rather than rewriting it ([#1423](https://github.com/databrickslabs/ucx/issues/1423)). The `_aws_role_trust_doc` method of the `aws.py` file has been updated to return a dictionary object instead of a JSON string for the AWS IAM role trust policy document. This change allows for more fine-grained control when updating the trust relationships of an existing role in AWS IAM. The `create_uc_role` method has been updated to pass the role trust document to the `_create_role` method using the `_get_json_for_cli` method. The `update_uc_trust_role` method has been refactored to retrieve the existing role's trust policy document, modify its `Statement` field, and replace it with the returned value of the `_aws_role_trust_doc` method with the specified `external_id`. Additionally, the `test_update_uc_trust_role` function in the `test_aws.py` file has been updated to provide more detailed and realistic mocked responses for the `command_call` function, including handling the case where the `iam update-assume-role-policy` command is called and returning a mocked response with a modified assume role policy document that includes a new principal with an external ID condition. These changes improve the testing capabilities of the `test_update_uc_trust_role` function and provide more comprehensive testing of the assume role statement and role update functionality.
* Modifies dependency resolution logic to detect deprecated use of s3fs package ([#1395](https://github.com/databrickslabs/ucx/issues/1395)). In this release, the dependency resolution logic has been enhanced to detect and handle deprecated usage of the s3fs package. A new function, `_download_side_effect`, has been implemented to mock the download behavior of the `workspace_client_mock` function, allowing for more precise control during testing. The `DependencyResolver` class now includes a list of `Advice` objects to inform developers about the use of deprecated dependencies, without modifying the `DependencyGraph` class. This change also introduces a new import statement for the s3fs package, encouraging the adoption of up-to-date packages and practices for improved system compatibility and maintainability. Additionally, a unit test file, test_s3fs.py, has been added with test cases for various import scenarios of s3fs to ensure proper detection and issuance of deprecation warnings.
* Prompt for warehouse choice in uninstall if the original chosen warehouse does not exist anymore ([#1484](https://github.com/databrickslabs/ucx/issues/1484)). In this release, we have added a new method `_check_and_fix_if_warehouse_does_not_exists()` to the `WorkspaceInstaller` class, which checks if the specified warehouse in the configuration still exists. If it doesn't, the method generates a new configuration using a new `WorkspaceInstaller` object, saves it, and updates the `_sql_backend` attribute with the new warehouse ID. This change ensures that if the original chosen warehouse no longer exists, the user will be prompted to choose a new one during uninstallation. Additionally, we have added a new import statement for `ResourceDoesNotExist` exception and introduced a new function `test_uninstallation_after_warehouse_is_deleted`, which simulates a scenario where a warehouse has been manually deleted and checks if the uninstallation process correctly resets the warehouse. The `StatementExecutionBackend` object is initialized with a non-existent warehouse ID, and the configuration and sql_backend objects are updated accordingly. This test case ensures that the uninstallation process handles the scenario where a warehouse has been manually deleted.
* Propagate source location information within the import package dependency graph ([#1431](https://github.com/databrickslabs/ucx/issues/1431)). This change modifies the dependency graph build logic within several modules of the `databricks.labs.ucx` package to propagate source location information within the import package dependency graph. A new `ImportDependency` class now represents import sources, and a `list_import_sources` method returns a list of `ImportDependency` objects, which include import string and original source code file path. A new `IncompatiblePackage` class is added to the `Whitelist` class, returning `UCCompatibility.NONE` when checking for compatibility. The `ImportChecker` class checks for deprecated imports and returns `Advice` or `Deprecation` objects with location information. Unit tests have been added to ensure the correct behavior of these changes. Additionally, the `Location` class and a new test function for invalid processors have been introduced.
* Scan `site-packages` ([#1411](https://github.com/databrickslabs/ucx/issues/1411)). A SitePackages scanner has been implemented, enhancing the linkage of module root names with the actual Python code within installed packages using metadata. This development addresses issue [#1410](https://github.com/databrickslabs/ucx/issues/1410) and is connected to [#1202](https://github.com/databrickslabs/ucx/issues/1202). New functionalities include user documentation, a CLI command, a workflow, and a table, accompanied by modifications to an existing command and workflow, as well as alterations to another table. Unit tests have been added to ensure the feature's proper functionality. In the diff, a new unit test file for `site_packages.py` has been added, checking for `databrix` compatibility, which returns as uncompatible. This enhancement aims to bolster the user experience by providing more detailed insights into installed packages.
* Select DISTINCT job_run_id ([#1352](https://github.com/databrickslabs/ucx/issues/1352)). A modification has been implemented to optimize the SQL query for accessing log data, now retrieving distinct job_run_ids instead of a single one, nested in a subquery. The enhanced query selects the message field from the inventory.logs table, filtering based on job_run_id matches with the latest timestamp within the same table. This change enables multiple job_run_ids to correlate with the same timestamp, delivering a more holistic perspective of logs at a given moment. By upgrading the query functionality to accommodate multiple job run IDs, this improvement ensures more precise and detailed retrieval of log data.
* Support table migration to Unity Catalog in Python code ([#1210](https://github.com/databrickslabs/ucx/issues/1210)). This release introduces changes to the Python codebase that enhance the SparkSql linter/fixer to support migrating Spark SQL table references to Unity Catalog. The release includes modifications to existing commands, specifically `databricks labs ucx migrate_local_code`, and the addition of unit tests. The `SparkSql` class has been updated to support a new `index` parameter, allowing for migration support. New classes including `QueryMatcher`, `TableNameMatcher`, `ReturnValueMatcher`, and `SparkMatchers` have been added to hold various matchers for different spark methods. The release also includes modifications to existing methods for caching, creating, getting, refreshing, and un-caching tables, as well as updates to the `listTables` method to reflect the new format. The `saveAsTable` and `register` methods have been updated to handle variable and f-string arguments for the table name. The `databricks labs ucx migrate_local_code` command has been modified to handle spark.sql function calls that include a table name as a parameter and suggest necessary changes to migrate to the new Unity Catalog format. Integration tests are still needed.
* When building dependency graph, raise problems with problematic dependencies ([#1529](https://github.com/databrickslabs/ucx/issues/1529)). A new `DependencyProblem` class has been added to the databricks.labs.ucx.source_code.dependencies module to handle issues encountered during dependency graph construction. This class is used to raise issues when problematic dependencies are encountered during the build of the dependency graph. The `build_dependency_graph` method of the `SourceContainer` abstract class now accepts a `problem_collector` parameter, which is a callable function that collects and handles dependency problems. Instead of raising `ValueError` exceptions, the `DependencyProblem` class is used to collect and store information about the issues. This change improves error handling and diagnostic information during dependency graph construction. Relevant user documentation, a new CLI command, and a new workflow have been added, along with modifications to existing commands and workflows. Unit tests have been added to verify the new functionality.
* WorkspacePath to implement `pathlib.Path` API ([#1509](https://github.com/databrickslabs/ucx/issues/1509)). A new file, 'wspath.py', has been added to the `mixins` directory of the 'databricks.labs.ucx' package, implementing the custom Path object 'WorkspacePath'. This subclass of 'pathlib.Path' provides additional methods and functionality for the Databricks Workspace, including 'cwd()', 'home()', 'scandir()', and 'listdir()'. `WorkspacePath` interacts with the Databricks Workspace API for operations such as checking if a file/directory exists, creating and deleting directories, and downloading files. The `WorkspacePath` class has been updated to implement 'pathlib.Path' API for a more intuitive and consistent interface when working with file and directory paths. The class now includes methods like 'absolute()', 'exists()', 'joinpath()', 'parent', and supports the `with` statement for thread-safe code. A new test file 'test_wspath.py' has been added for the WorkspacePath mixin. New methods like 'expanduser()', 'as_fuse()', 'as_uri()', 'replace()', 'write_text()', 'write_bytes()', 'read_text()', and 'read_bytes()' have also been added. 'mkdir()' and 'rmdir()' now raise errors when called on non-absolute paths and non-empty directories, respectively.

Dependency updates:

 * Bump actions/checkout from 3 to 4 ([#1191](https://github.com/databrickslabs/ucx/pull/1191)).
 * Bump actions/setup-python from 4 to 5 ([#1189](https://github.com/databrickslabs/ucx/pull/1189)).
 * Bump codecov/codecov-action from 1 to 4 ([#1190](https://github.com/databrickslabs/ucx/pull/1190)).
 * Bump softprops/action-gh-release from 1 to 2 ([#1188](https://github.com/databrickslabs/ucx/pull/1188)).
 * Bump databricks-sdk from 0.23.0 to 0.24.0 ([#1223](https://github.com/databrickslabs/ucx/pull/1223)).
 * Updated databricks-labs-lsql requirement from ~=0.3.0 to >=0.3,<0.5 ([#1387](https://github.com/databrickslabs/ucx/pull/1387)).
 * Updated sqlglot requirement from ~=23.9.0 to >=23.9,<23.11 ([#1409](https://github.com/databrickslabs/ucx/pull/1409)).
 * Updated sqlglot requirement from <23.11,>=23.9 to >=23.9,<23.12 ([#1486](https://github.com/databrickslabs/ucx/pull/1486)).
nfx added a commit that referenced this pull request Apr 26, 2024
* A notebook linter to detect DBFS references within notebook cells
([#1393](https://github.com/databrickslabs/ucx/issues/1393)). A new
linter has been implemented in the open-source library to identify
references to Databricks File System (DBFS) mount points or folders
within SQL and Python cells of Notebooks, raising Advisory or Deprecated
alerts when detected. This feature, resolving issue
[#1108](https://github.com/databrickslabs/ucx/issues/1108), enhances
code maintainability by discouraging DBFS usage, and improves security
by avoiding hard-coded DBFS paths. The linter's functionality includes
parsing the code and searching for Table elements within statements,
raising warnings when DBFS references are found. Implementation changes
include updates to the `NotebookLinter` class, a new `from_source` class
method, and an `original_offset` argument in the `Cell` class. The
linter now also supports the `databricks` dialect for SQL code parsing.
This feature improves the library's security and maintainability by
ensuring better data management and avoiding hard-coded DBFS paths.
* Added CLI commands to trigger table migration workflow
([#1511](https://github.com/databrickslabs/ucx/issues/1511)). A new
`migrate_tables` command has been added to the 'databricks.labs.ucx.cli'
module, which triggers the `migrate-tables` workflow and, optionally,
the `migrate-external-hiveserde-tables-in-place-experimental` workflow.
The `migrate-tables` workflow is responsible for managing table
migrations, while the
`migrate-external-hiveserde-tables-in-place-experimental` workflow
handles migrations for external hiveserde tables. The new `What` class
from the 'databricks.labs.ucx.hive_metastore.tables' module is used to
identify hiveserde tables. If hiveserde tables are detected, the user is
prompted to confirm running the
`migrate-external-hiveserde-tables-in-place-experimental` workflow. The
`migrate_tables` command requires a WorkspaceClient and Prompts objects
and accepts an optional WorkspaceContext object, which is set to the
WorkspaceContext of the WorkspaceClient if not provided. Additionally, a
new `migrate_external_hiveserde_tables_in_place` command has been added
which will run the
`migrate-external-hiveserde-tables-in-place-experimental` workflow if it
finds any hiveserde tables, making it easier to manage table migrations
from the command line.
* Added CSV, JSON and include path in mounts
([#1329](https://github.com/databrickslabs/ucx/issues/1329)). In this
release, the TablesInMounts function has been enhanced to support CSV
and JSON file formats, along with the existing Parquet and Delta table
formats. The new `include_paths_in_mount` parameter has been introduced,
enabling users to specify a list of paths to crawl within all mounts.
The WorkspaceConfig class in the config.py file has been updated to
accommodate these changes. Additionally, a new `_assess_path` method has
been introduced to assess the format of a given file and return a
`TableInMount` object accordingly. Several existing methods, such as
`_find_delta_log_folders`, `_is_parquet`, `_is_csv`, `_is_json`, and
`_path_is_delta`, have been updated to reflect these improvements.
Furthermore, two new unit tests, `test_mount_include_paths` and
`test_mount_listing_csv_json`, have been added to ensure the proper
functioning of the TablesInMounts function with the new file formats and
the `include_paths_in_mount` parameter. These changes aim to improve the
functionality and flexibility of the TablesInMounts library, allowing
for more precise crawling and identification of tables based on specific
file formats and paths.
* Added CTAS migration workflow for external tables cannot be in place
migrated ([#1510](https://github.com/databrickslabs/ucx/issues/1510)).
In this release, we have added a new CTAS (Create Table As Select)
migration workflow for external tables that cannot be migrated in-place.
This feature includes a `MigrateExternalTablesCTAS` class with three
tasks to migrate non-SYNC supported and non-HiveSerde external tables,
migrate HiveSerde tables, and migrate views from the Hive Metastore to
the Unity Catalog. We have also added new methods for managed and
external table migration, deprecated old methods, and added a new test
function to ensure proper CTAS migration for external tables using
HiveSerDe. This change also introduces a new JSON file for external
table configurations and a mock backend to simulate the Hive Metastore
and test the migration process. Overall, these changes improve the
migration capabilities for external tables and ensure a more flexible
and reliable migration process.
* Added Python linter for table creation with implicit format
([#1435](https://github.com/databrickslabs/ucx/issues/1435)). A new
linter has been added to the Python library to advise on implicit table
formats when the 'writeTo', 'table', 'insertInto', or `saveAsTable`
methods are invoked without an explicit format specified in the same
chain of calls. This feature is useful for software engineers working
with Databricks Runtime (DBR) v8.0 and later, where the default table
format changed from `parquet` to 'delta'. The linter, implemented in
'table_creation.py', utilizes reusable AST utilities from
'python_ast_util.py' and is not automated, providing advice instead of
fixing the code. The linter skips linting when a DRM version of 8.0 or
higher is passed, as the default format change only applies to versions
prior to 8.0. Unit tests have been added for both files as part of the
code migration workflow.
* Added Support for Migrating Table ACL of Interactive clusters using
SPN ([#1077](https://github.com/databrickslabs/ucx/issues/1077)). This
change introduces support for migrating table Access Control Lists
(ACLs) of interactive clusters using a Security Principal Name (SPN) for
Azure Databricks environments in the UCX project. It includes
modifications to the `hive_metastore` and `workspace_access` modules, as
well as the addition of new classes, methods, and import statements for
handling ACLs and grants. This feature enables more secure and granular
control over table permissions when using SPN authentication for
interactive clusters in Azure. This will benefit software engineers
working with interactive clusters in Azure Databricks by enhancing
security and providing more control over data access.
* Added Support for migrating Schema/Catalog ACL for Interactive cluster
([#1413](https://github.com/databrickslabs/ucx/issues/1413)). This
commit adds support for migrating schema and catalog ACLs for
interactive clusters, specifically for AWS and Azure, with partial fixes
for issues [#1192](https://github.com/databrickslabs/ucx/issues/1192)
and [#1193](https://github.com/databrickslabs/ucx/issues/1193). The
changes identify and filter database ACL grants, create mappings from
Hive metastore schema to Unity Catalog schema and catalog, and replace
Hive metastore actions with equivalent Unity Catalog actions for both
schema and catalog. External location permission is not included in this
commit and will be addressed separately. New methods for creating
mappings, updating principal ACLs, and getting catalog schema grants
have been added, and existing functionalities have been modified to
handle both AWS and Azure. The code has undergone manual testing and
passed unit and integration tests. The changes are targeted towards
software engineers who adopt the project.
* Added `databricks labs ucx logs` command
([#1350](https://github.com/databrickslabs/ucx/issues/1350)). A new
command, 'databricks labs ucx logs', has been added to the open-source
library to enhance logging and debugging capabilities. This command
allows developers and administrators to view logs from the latest job
run or specify a particular workflow name to display its logs. By
default, logs with levels of INFO, WARNING, and ERROR are shown, but the
--debug flag can be used for more detailed DEBUG logs. This feature
utilizes the relay_logs method from the deployed_workflows object in the
WorkspaceContext class and addresses issue
[#1282](https://github.com/databrickslabs/ucx/issues/1282). The addition
of this command aims to improve the usability and maintainability of the
framework, making it easier for users to diagnose and resolve issues.
* Added check for DBFS mounts in SQL code
([#1351](https://github.com/databrickslabs/ucx/issues/1351)). A new
feature has been introduced to check for Databricks File System (DBFS)
mounts within SQL code, enhancing data management and accessibility in
the Databricks environment. The `dbfsqueries.py` file in the
`databricks/labs/ucx/source_code` directory now includes a function that
verifies the presence of DBFS mounts in SQL queries and returns
appropriate messages. The `Languages` class in the `__init__` method has
been updated to incorporate a new class, `FromDbfsFolder`, which
replaces the existing `from_table` linter with a new linter,
`DBFSUsageLinter`, for handling DBFS usage in SQL code. In addition, a
Staff Software Engineer has improved the functionality of a DBFS usage
linter tool by adding new methods to check for deprecated DBFS mounts in
SQL code, returning deprecation warnings as needed. These enhancements
ensure more robust handling of DBFS mounts throughout the system,
allowing for better integration and management of DBFS-related issues in
SQL-based operations.
* Added check for circular view dependency
([#1502](https://github.com/databrickslabs/ucx/issues/1502)). A circular
view dependency check has been implemented to prevent issues caused by
circular dependencies in views. This includes a new test for chained
circular dependencies (A->B, B->C, C->A) and an update to the existing
circular view dependency test. The checks have been implemented through
modifications to the tests in `test_views_sequencer.py`, including a new
test method and an update to the existing test method. If any circular
dependencies are encountered during migration, a ValueError with an
error message will be raised. These changes include updates to the
`tables_and_views.json` file, with the addition of a new view `v12` that
depends on `v11`, creating a circular dependency. The changes have been
tested through the addition of unit tests and are expected to function
as intended. No new methods have been added, but changes have been made
to the existing `_next_batch` method and two new methods,
`_check_circular_dependency` and `_get_view_instance`, have been
introduced.
* Added commands for metastores listing & assignment
([#1489](https://github.com/databrickslabs/ucx/issues/1489)). This
commit introduces new commands for handling metastores in the Databricks
Labs Unity Catalog (UCX) tool, which enables more efficient management
of metastores. The `databricks labs ucx assign-metastore` command
automatically assigns a metastore to a specified workspace when
possible, while the `databricks labs ucx show-all-metastores` command
displays all possible metastores that can be assigned to a workspace.
These changes include new methods for handling metastores in the account
and workspace classes, as well as new user documentation, manual
testing, and unit tests. The new functionality is added to improve the
usability and efficiency of the UCX tool in handling metastores.
Additional information on the UCX metastore commands is provided in the
README.md file.
* Added functionality to migrate external tables using Create Table (No
Sync) ([#1432](https://github.com/databrickslabs/ucx/issues/1432)). A
new feature has been implemented for migrating external tables in
Databricks' Hive metastore using the "Create Table (No Sync)" method.
This feature includes the addition of two new methods,
`_migrate_non_sync_table` and `_get_create_in_place_sql`, for handling
migration and SQL query generation. The existing methods
`_migrate_dbfs_root_table` and `_migrate_acl` have also been updated. A
test case has been added to demonstrate migration of external tables
while preserving their location and properties. This new functionality
provides more flexibility in managing migrations for specific use cases.
The SQL parsing library sqlglot has been utilized to replace the current
table name with the updated catalog and change the CREATE statement to
CREATE IF NOT EXISTS. This increases the efficiency and security of
migrating external tables in the Databricks' Hive metastore.
* Added initial version of account-level installer
([#1339](https://github.com/databrickslabs/ucx/issues/1339)). A new
account-level installer has been added to the UCX library, allowing
account administrators to install UCX on all workspaces within an
account in a single operation. The installer authenticates to the
account, prompts the user for configuration of the first workspace, and
then runs the installation and offers to repeat the process for all
remaining workspaces. This is achieved through the creation of a new
`prompt_for_new_installation` method which saves user responses to a new
`InstallationConfig` data class, allowing for reuse in other workspaces.
The existing `databricks labs install ucx` command now supports
account-level installation when the `UCX_FORCE_INSTALL` environment
variable is set to 'account'. The changes have been manually tested and
include updates to documentation and error handling for
`PermissionDenied`, `NotFound`, and `ValueError` exceptions.
Additionally, a new `AccountInstaller` class has been added to manage
the installation process at the account level.
* Added linting for DBFS usage
([#1341](https://github.com/databrickslabs/ucx/issues/1341)). A new
linter, "DBFSUsageLinter", has been added to our open-source library to
check for deprecated file system paths in Python code, specifically for
Database File System (DBFS) usage. Implemented as part of the
"databricks.labs.ucx.source_code" package in the "languages.py" file,
this linter defines a visitor, "DetectDbfsVisitor", that detects file
system paths in the code and checks them against a list of known
deprecated paths. If a match is found, it creates a Deprecation or
Advisory object with information about the deprecated code, including
the line number and column offset, and adds it to a list. This feature
will assist in identifying and removing deprecated file system paths
from the codebase, ensuring consistent and proper use of DBFS within the
project.
* Added log task to parse logs and store the logs in the ucx database
([#1272](https://github.com/databrickslabs/ucx/issues/1272)). A new log
task has been added to parse logs and store them in the ucx database,
added as a log crawler task to all workflows after other tasks have
completed. The LogRecord has been updated to include all necessary
fields, and logs below a certain minimum level will no longer be stored.
A new CLI command to retrieve errors and warnings from the latest
workflow run has been added, while existing commands and workflows have
been modified. User documentation has been updated, and new methods have
been added for log parsing and storage. A new table called `logs` has
been added to the database, and unit and integration tests have been
added to ensure functionality. This change also resolves issues
[#1148](https://github.com/databrickslabs/ucx/issues/1148) and
[#1283](https://github.com/databrickslabs/ucx/issues/1283), with
modifications to existing classes such as RuntimeContext,
TaskRunWarningRecorder, and LogRecord, and the addition of new classes
and methods including HiveMetastoreLineageEnabler and LogRecord in the
logs.py file. The deploy_schema function has been updated to include the
new table, and the existing command `databricks labs ucx` has been
modified to accommodate the new log functionality. Existing workflows
have been updated and a new workflow has been added, all of which are
tested through unit tests, integration tests, and manual testing. The
`TaskLogger` class and `TaskRunWarningRecorder` class are used to log
and record task run data, with the `parse_logs` method used to parse log
files into partial log records, which are then used to create snapshot
rows in the `logs` table.
* Added migration for non delta dbfs tables using Create Table As Select
(CTAS). Convert such tables to Delta tables
([#1434](https://github.com/databrickslabs/ucx/issues/1434)). In this
release, we've developed new methods to migrate non-Delta DBFS root
tables to managed Delta tables, enhancing compatibility with various
table formats and configurations. We've added support for safer SQL
statement generation in our Create Table As Select (CTAS) functionality
and incorporated new creation methods. Additionally, we've introduced
grant assignments during the migration process and updated integration
tests. The changes include the addition of a `TablesMigrator` class with
an updated `migrate_tables` method, a new `PrincipalACL` parameter, and
the `test_dbfs_non_delta_tables_should_produce_proper_queries` function
to test the migration of non-Delta DBFS tables to managed Delta tables.
These improvements promote safer CTAS functionality and expanded
compatibility for non-Delta DBFS root tables.
* Added support for %pip cells
([#1401](https://github.com/databrickslabs/ucx/issues/1401)). A new cell
type, %pip, has been introduced to the notebook interface, allowing for
the execution of pip commands within the notebook. The new class,
PipCell, has been added with several methods, including is_runnable,
build_dependency_graph, and migrate_notebook_path, enabling the notebook
interface to recognize and handle pip cells differently from other cell
types. This allows for the installation of Python packages directly
within a notebook setting, enhancing the notebook environment and
providing users with the ability to dynamically install necessary
packages as they work. The new sample notebook file demonstrates the
installation of a package using the %pip install command. The
implementation includes modifying the notebook runtime to recognize and
execute %pip cells, and installing packages in a manner consistent with
standard pip installation processes. Additionally, a new tuple,
PIP_NOTEBOOK_SAMPLE, has been added to the existing test notebook sample
tuple list, enabling testing the handling of %pip cells during notebook
splitting.
* Added support for %sh cells
([#1400](https://github.com/databrickslabs/ucx/issues/1400)). A new
`SHELL` CellLanguage has been implemented to support %sh cells, enabling
the execution of shell commands directly within the notebook interface.
This enhancement, addressing issue
[#1400](https://github.com/databrickslabs/ucx/issues/1400) and linked to
[#1399](https://github.com/databrickslabs/ucx/issues/1399) and
[#1202](https://github.com/databrickslabs/ucx/issues/1202), streamlines
the process of running shell scripts in the notebook, eliminating the
need for external tools. The new SHELL_NOTEBOOK_SAMPLE tuple, part of
the updated test suite, demonstrates the feature's functionality with a
shell cell, while the new methods manage the underlying mechanics of
executing these shell commands. These changes not only extend the
platform's capabilities by providing built-in support for shell commands
but also improve productivity and ease-of-use for teams relying on shell
commands as part of their data processing and analysis pipelines.
* Added support for migrating Table ACL for interactive cluster in AWS
using Instance Profile
([#1285](https://github.com/databrickslabs/ucx/issues/1285)). This
change adds support for migrating table access control lists (ACLs) for
interactive clusters in AWS using an Instance Profile. A new method
`get_iam_role_from_cluster_policy` has been introduced in the `AwsACL`
class, which replaces the static method
`_get_iam_role_from_cluster_policy`. The `create_uber_principal` method
now uses this new method to obtain the IAM role name from the cluster
policy. Additionally, the project now includes AWS Role Action and AWS
Resource Permissions to handle permissions for migrating table ACLs for
interactive clusters in AWS. New methods and classes have been added to
support AWS-specific functionality and handle AWS instance profile
information. Two new tests have been added to tests/unit/test_cli.py to
test various scenarios for interactive clusters with and without ACL in
AWS. A new argument `is_gcp` has been added to WorkspaceContext to
differentiate between Google Cloud Platform and other cloud providers.
* Added support for views in `table-migration` workflow
([#1325](https://github.com/databrickslabs/ucx/issues/1325)). A new
`MigrationStatus` class has been added to track the migration status of
tables and views in a Hive metastore, and a `MigrationIndex` class has
been added to check if a table or view has been migrated or not. The
`MigrationStatusRefresher` class has been updated to use a new approach
for migrating tables and views, and is now responsible for refreshing
the migration status of tables and indexing it using the
`MigrationIndex` class. A `ViewsMigrationSequencer` class has also been
introduced to sequence the migration of views based on dependencies.
These changes improve the migration process for tables and views in the
`table-migration` workflow.
* Added workflow for in-place migrating external Parquet, Orc, Avro
hiveserde tables
([#1412](https://github.com/databrickslabs/ucx/issues/1412)). This
change introduces a new workflow, `MigrateHiveSerdeTablesInPlace`, for
in-place upgrading external Parquet, Orc, and Avro hiveserde tables to
the Unity Catalog. The workflow includes new functions to describe the
table and extract hiveserde details, update the DDL from `show create
table`, and replace the old table name with the migration target and
DBFS mount table location if any. A new function
`_migrate_external_table_hiveserde` has been added to
`table_migrate.py`, and two new arguments, `mounts` and
`hiveserde_in_place_migrate`, have been added to the `TablesMigrator`
class. These arguments control which hiveserde to migrate and replace
the DBFS mnt table location if any, enabling multiple tasks to run in
parallel and migrate only one type of hiveserde at a time. This feature
does not include user documentation, new CLI commands, or changes to
existing commands, but it does add a new workflow and modify the
existing `migrate_tables` function in `table_migrate.py`. The changes
have been manually tested, but no unit tests, integration tests, or
staging environment verification have been provided.
* Build dependency graph for local files
([#1462](https://github.com/databrickslabs/ucx/issues/1462)). This
commit refactors dependency classes to distinguish between resolution
and loading, and introduces new classes to handle different types of
dependencies. A new method, `LocalFileMigrator.build_dependency_graph`,
is implemented, following the pattern of `NotebookMigrator`, to build a
dependency graph for local files. This resolves issue
[[#1202](https://github.com/databrickslabs/ucx/issues/1202)](https://github.com/databrickslabs/ucx/issues/1202)
and addresses issue
[[#1360](https://github.com/databrickslabs/ucx/issues/1360)](https://github.com/databrickslabs/ucx/issues/1360).
While the refactoring and implementation of new methods improve the
accuracy of dependency graphs and ensure that dependencies are correctly
registered based on the file's language, there are no user-facing
changes, such as new or modified CLI commands, tables, or workflows.
Unit tests are added to ensure that the new changes function as
expected.
* Build dependency graph for site packages
([#1504](https://github.com/databrickslabs/ucx/issues/1504)). This
commit introduces changes to the dependency graph building process for
site packages within the ucx project. When a package is not recognized,
package files are added as dependencies to prevent errors during import
dependency determination, thereby fixing an infinite loop issue when
encountering cyclical graphs. This resolves issues
[#1427](https://github.com/databrickslabs/ucx/issues/1427) and is
related to [#1202](https://github.com/databrickslabs/ucx/issues/1202).
The changes include adding new methods for handling package files as
dependencies and preventing infinite loops when visiting cyclical
graphs. The `SitePackage` class in the `site_packages.py` file has been
updated to handle package files more accurately, with the `__init__`
method now accepting `module_paths` as a list of Path objects instead of
a list of strings. A new method, `module_paths`, has also been
introduced. Unit tests have been added to ensure the correct
functionality of these changes, and a hack in the PR will be removed
once issue [#1421](https://github.com/databrickslabs/ucx/issues/1421) is
implemented.
* Build notebook dependency graph for `%run` cells
([#1279](https://github.com/databrickslabs/ucx/issues/1279)). A new
`Notebook` class has been developed to parse source code and split it
into cells, and a `NotebookDependencyGraph` class with related utilities
has been added to discover dependencies in `%run` cells, addressing
issue [#1201](https://github.com/databrickslabs/ucx/issues/1201). The
new functionality enhances the management and tracking of dependencies
within notebooks, improving code organization and efficiency. The commit
includes updates to existing notebooks to utilize the new classes and
methods, with no impact on existing functionality outside of the `%run`
context.
* Create UC External Location, Schema, and Table Grants based on
workspace-wide Azure SPN mount points
([#1374](https://github.com/databrickslabs/ucx/issues/1374)). This
change adds new functionality to create Unity Catalog (UC) external
location, schema, and table grants based on workspace-wide Azure Service
Principal Names (SPN) mount points. The majority of the work was
completed in a previous pull request. The main change in this pull
request is the addition of a new test function,
`test_migrate_external_tables_with_principal_acl_azure`, which tests the
migration of tables with principal ACLs in an Azure environment. This
function includes the creation of a new user with cluster access,
another user without cluster access, and a new group with cluster access
to validate the migration of table grants to these entities. The
`make_cluster_permissions` method now accepts a `service_principal_name`
parameter, and after migrating the tables with the `acl_strategy` set to
`PRINCIPAL`, the function checks if the appropriate grants have been
assigned to the Azure SPN. This change is part of an effort to improve
the integration of Unity Catalog with Azure SPNs and is accessible
through the UCX CLI command. The changes have been tested through manual
testing, unit tests, and integration tests and have been verified in a
staging environment.
* Detect DBFS use in SQL statements in notebooks
([#1372](https://github.com/databrickslabs/ucx/issues/1372)). A new
linter has been added to detect and discourage the use of DBFS
(Databricks File System) in SQL statements within notebooks. This linter
raises deprecated advisories for any identified DBFS folder or mount
point references in SQL statements, encouraging the use of alternative
storage options. The change is implemented in the `NotebookLinter` class
of the 'notebook_linter.py' file, and is tested through unit tests to
ensure proper functionality. The target audience for this update
includes software engineers who use Databricks or similar platforms, as
the new linter will help users transition away from using DBFS in their
SQL statements and adopt alternative storage methods.
* Detect `sys.path` manipulation
([#1380](https://github.com/databrickslabs/ucx/issues/1380)). A change
has been introduced to the Python linter to detect manipulation of
`sys.path`. New classes, AbsolutePath and RelativePath, have been added
as subclasses of SysPath. The SysPathVisitor class has been implemented
to track additions to sys.path and the visit_Call method in
SysPathVisitor checks for 'sys.path.append' and 'os.path.abspath' calls.
The new functionality includes a new method, collect_appended_sys_paths
in PythonLinter, and a static method, list_appended_sys_paths, to
retrieve the appended paths. Additionally, new tests have been added to
the PythonLinter to detect manipulation of the `sys.path` variable,
specifically the `list_appended_sys_paths` method. The new test cases
include using aliases for `sys`, `os`, and `os.path`, and using both
absolute and relative paths. This improvement will enhance the linter's
ability to detect potential issues related to manipulation of the
`sys.path` variable. The change resolves issue
[#1379](https://github.com/databrickslabs/ucx/issues/1379) and is linked
to issue [#1202](https://github.com/databrickslabs/ucx/issues/1202). No
user documentation or CLI commands have been added or modified, and no
manual testing has been performed. Unit tests for the new functionality
have been added.
* Detect direct access to cloud storage and raise a deprecation warning
([#1506](https://github.com/databrickslabs/ucx/issues/1506)). In this
release, the Pyspark linter has been enhanced to detect and issue
deprecation warnings for direct access to cloud storage. This change,
which resolves issue
[#1133](https://github.com/databrickslabs/ucx/issues/1133), introduces
new classes `AstHelper` and `TableNameMatcher` to determine the
fully-qualified name of functions and replace instances of direct cloud
storage access with migration index table names. Instances of direct
access using 'dbfs:/', 'dbfs://', and default 'dbfs:' references will
now be detected and flagged with a deprecation warning. The test file
`test_pyspark.py` has been updated to include new tests for detecting
direct cloud storage access. Users should be aware of these changes when
updating their code to avoid deprecation warnings.
* Detect imported files and packages
([#1362](https://github.com/databrickslabs/ucx/issues/1362)). This
commit introduces functionality to parse Python code for `import` and
`import from` processing instructions, enabling the detection and
management of imported files and packages. It includes a new CLI
command, modifications to existing commands, new and updated workflows,
and additional tables. The code modifications include new methods for
visiting Import and ImportFrom nodes, and the addition of unit tests to
ensure correctness. Relevant user documentation has been added, and the
new functionality has been tested through manual testing, unit tests,
and verification on a staging environment. This comprehensive update
enhances dependency management, code organization, and understanding for
a more streamlined user experience.
* Enhanced migrate views task to support views created with explicit
column list
([#1375](https://github.com/databrickslabs/ucx/issues/1375)). The commit
enhances the migrate views task to better support handling of views with
an explicit column list, improving overall compatibility. A new lookup
based on `SHOW CREATE TABLE` has been added to extract the column list
from the create script, ensuring accurate migration. The
`_migrate_view_table` method has been refactored, and a new
`_sql_migrate_view` method is added to fetch the create statement of the
view. The `ViewToMigrate` class has been updated with a new
`_view_dependencies` method to determine view dependencies in the new
SQL text. Additionally, new methods `safe_sql_key` and `add_table` have
been introduced, and the `sqlglot.parse` method is used to parse the
code with `databricks` as the read argument. A new test for migrating
views with an explicit column list has been added, along with the
`upgraded_from` and `upgraded_to` table properties, and the migration
status is updated to reflect successful migration. New test functions
have also been added to test the migration of views with columns and
ACLs. Dependency sqlglot has been updated to version ~=23.9.0, enhancing
the overall functionality and compatibility of the migrate views task.
* Ensure that USE statements are recognized and apply to table
references without a qualifying schema in SQL and pyspark
([#1433](https://github.com/databrickslabs/ucx/issues/1433)). This
commit enhances the library's functionality in handling `USE` statements
in both SQL and PySpark by ensuring they are recognized and applied to
table references without a qualifying schema. A new
`CurrentSessionState` class is introduced to manage the current schema
of a session, and existing classes such as `FromTable` and
`TableNameMatcher` are updated to use this new class. Additionally, the
`lint` and `apply` methods have been updated to handle `USE` statements
and improve the precision of table reference handling. These changes are
particularly useful when working with tables in different schemas,
ensuring the library can manage table references more accurately in SQL
and PySpark. A new fixture, 'extended_test_index', has been added to
support unit tests, and the test file 'test_notebook.py' has been
updated to better reflect the intended schema for each table reference.
* Expand documentation for end to end workflows with external HMS
([#1458](https://github.com/databrickslabs/ucx/issues/1458)). The UCX
toolkit has been updated to support integration with an external Hive
Metastore (HMS), in addition to the default workspace HMS. This feature
allows users to easily set up UCX to work with an existing external HMS,
providing greater flexibility in managing and accessing data. During
installation, UCX will scan for evidence of an external HMS in the
cluster policies and Spark configurations. If found, UCX will prompt the
user to connect to the external HMS, create a new policy with the
necessary Spark and data access configurations, and set up job clusters
accordingly. However, users will need to manually update the data access
configuration for SQL Warehouses that are not configured for external
HMS. Users can also create a cluster policy with appropriate Spark
configurations and data access for external HMS, or edit existing
policies in specified UCX workflows. Once set up, the assessment
workflow will scan tables and views from the external HMS, and the table
migration workflow will upgrade tables and views from the external HMS
to the Unity Catalog. Users should note that if the external HMS is
shared between multiple workspaces, a different inventory database name
should be specified for each UCX installation. It is important to plan
carefully when setting up a workspace with multiple external HMS, as the
assessment dashboard will fail if the SQL warehouse is not configured
correctly. Users can have multiple UCX installations in a workspace,
each set up with a different external HMS, or manually modify the
cluster policy and SQL data access configuration to point to the correct
external HMS after UCX has been installed.
* Extend service principal migration with option to create access
connectors with managed identity for each storage account
([#1417](https://github.com/databrickslabs/ucx/issues/1417)). This
commit extends the service principal migration feature to create access
connectors with managed identities for each storage account, enhancing
security and isolation by preventing cross-account access. A new CLI
command has been added, and an existing command has been modified. The
`create_access_connectors_for_storage_accounts` method creates access
connectors with the required permissions for each storage account used
in external tables. The `_apply_storage_permission` method has also been
updated. New unit and integration tests have been included, covering
various scenarios such as secret value decoding, secret read exceptions,
and single storage account testing. The necessary permissions for these
connectors will be set in a subsequent pull request. Additionally, a new
method, `azure_resources_list_access_connectors`, and
`azure_resources_get_access_connector` have been introduced to ensure
access connectors are returned as expected. This change has been tested
manually and through automated tests, ensuring backward compatibility
while providing improved security features.
* Fixed UCX policy creation when instance pool is specified
([#1457](https://github.com/databrickslabs/ucx/issues/1457)). In this
release, we have made significant improvements to the handling of
instance pools in UCX policy creation. The `policy.py` file has been
updated to properly handle the case when an instance pool is specified,
by setting the `instance_pool_id` attribute and removing the
`node_type_id` attribute in the policy definition. Additionally, the
availability attribute has been removed for all cloud providers,
including AWS, Azure, and GCP, when an instance pool ID is provided. A
new `pop` method call has also been added to remove the
`gcp_attributes.availability` attribute when an instance pool ID is
provided. These changes ensure consistency in the policy definition
across all cloud providers. Furthermore, tests for this functionality
have been updated in the 'test_policy.py' file, specifically the
`test_cluster_policy_instance_pool` function, to check the correct
addition of the instance pool to the cluster policy. The purpose of
these changes is to improve the reliability and functionality of UCX
policy creation, specifically when an instance pool is specified.
* Fixed `migrate-credentials` command on aws
([#1501](https://github.com/databrickslabs/ucx/issues/1501)). In this
release, the `migrate-credentials` command for the `labs.yml`
configuration file has been updated to include new flags for specifying
a subscription ID and AWS profile. This allows users to scan a specific
storage account and authenticate using a particular AWS profile when
migrating credentials for storage access to UC storage credentials. The
`create-account-groups` command remains unchanged. Additionally, several
issues related to the `migrate-credentials` command for AWS have been
addressed, such as hallucinating the presence of a `--profile` flag,
using a monotonically increasing role ID, and not handling cases where
there are no IAM roles to migrate. The `run` method of the
`AwsUcStorageCredentials` class has been updated to handle these cases,
and several test functions have been added or updated to ensure proper
functionality. These changes improve the functionality and robustness of
the `migrate-credentials` command for AWS.
* Fixed edge case for `RegexSubStrategy`
([#1561](https://github.com/databrickslabs/ucx/issues/1561)). In this
release, we have implemented fixes for the `RegexSubStrategy` class
within the `GroupMigrationStrategy`, addressing an issue where matching
account groups could not be found using the display name. The
`generate_migrated_groups` function has been updated to include a check
for account groups with matching external IDs when either the display
name or regex substitution of the display name fails to yield a match.
Additionally, we have expanded testing for the `GroupManager` class,
which handles group management. This includes new tests using regular
expressions to match groups, and ensuring that the `GroupManager` class
can correctly identify and manage groups based on different criteria
such as the group's ID, display name, or external ID. These changes
improve the robustness of the `GroupMigrationStrategy` and ensure the
proper functioning of the `GroupManager` class when using regular
expression substitution and matching.
* Fixed table in mount partition scans for JSON and CSV
([#1437](https://github.com/databrickslabs/ucx/issues/1437)). This
release introduces a fix for an issue where table scans on partitioned
CSV and JSON files were not being correctly identified. The
`TablesInMounts` scan function has been updated to accurately detect
these files, addressing the problem reported in issue
[#1389](https://github.com/databrickslabs/ucx/issues/1389) and linked
issue [#1437](https://github.com/databrickslabs/ucx/issues/1437). To
ensure functionality, new private methods `_find_partition_file_format`
and `_assess_path` have been introduced, with the latter updated to
handle partitioned directories. Additionally, unit tests have been added
to test partitioned CSVs and JSONs, simulating the file system's
response to various calls. These changes provide enhanced detection and
handling of partitioned CSVs and JSONs in the `TablesInMounts` scan
function.
* Forward remote logs on `run_workflow` and removed `destroy-schema`
workflow in favour of `databricks labs uninstall ucx`
([#1349](https://github.com/databrickslabs/ucx/issues/1349)). In this
release, the `destroy-schema` workflow has been removed and replaced
with the `databricks labs uninstall ucx` command, addressing issue
[#1186](https://github.com/databrickslabs/ucx/issues/1186). The
`run_workflow` function has been updated to forward remote logs, and the
`run_task` function now accepts a new argument `sql_backend`. The `Task`
class includes a new method `is_testing()` and has been updated to use
`RuntimeBackend` before `SqlBackend` in the
`databricks.labs.lsql.backends` module. The `TaskLogger` class has been
modified to include a new argument `attempt` and a new class method
`log_path()`. The `verify_metastore` method in the `verification.py`
file has been updated to handle `PermissionDenied` exceptions more
gracefully. The `destroySchema` class and its `destroy_schema` method
have been removed. The `workflow_task.py` file has been updated to
include a new argument `attempt` in the `task_run_warning_recorder`
method. These changes aim to improve the system's efficiency, error
handling, and functionality.
* Give all access connectors `Storage Blob Data Contributor` role
([#1425](https://github.com/databrickslabs/ucx/issues/1425)). A new
change has been introduced to grant the `Storage Blob Data Contributor`
role, which provides the highest level of data access, to all access
connectors for each storage account in the system. This adjustment, part
of issue [#142](https://github.com/databrickslabs/ucx/issues/142)
* Grant uber principal write permissions so that SYNC command will
succeed ([#1505](https://github.com/databrickslabs/ucx/issues/1505)). A
change has been implemented to modify the `databricks labs ucx
create-uber-principal` command, granting the uber principal write
permissions on Azure Blob Storage. This aligns with the existing
implementation on AWS where the uber principal has write access to all
S3 buckets. The modification includes the addition of a new role,
"STORAGE_BLOB_DATA_CONTRIBUTOR", to the `_ROLES` dictionary in the
`resources.py` file. A new method, `clean_up_spn`, has also been added
to clear ucx uber service principals. This change resolves issue
[#939](https://github.com/databrickslabs/ucx/issues/939) and ensures
consistent behavior with AWS, enabling the uber principal to have write
permissions on all Azure blob containers and ensuring the success of the
`SYNC` command. The changes have been manually tested but not yet
verified on a staging environment.
* Handled new output format of `SHOW TBLPROPERTIES` command
([#1381](https://github.com/databrickslabs/ucx/issues/1381)). A recent
commit has been made to address an issue with the
`test_revert_migrated_table` test failing due to the new output format
of the `SHOW TBLPROPERTIES` command in the open-source library.
Previously, the output was blank if a table property was missing, but
now it shows a message indicating that the table does not have the
specified property. The commit updates the `is_migrated` method in the
`migration_status.py` file to handle this new output format, where the
method now uses the `fetch` method to retrieve the `upgraded_to`
property for a given schema and table. If the property is missing, the
method will continue to the next table. The commit also updates tests
for the changes, including a manual test that has not been verified on a
staging environment. Changes have been made in the
`test_table_migrate.py` file, where rows with table properties have been
updated to return new data, and the `timestamp` function now sets the
`datetime.datetime` to a `FakeDate`. No new methods have been added, and
existing functionality related to `SHOW TBLPROPERTIES` command output
handling has been changed in scope.
* Ignore whitelisted imports
([#1367](https://github.com/databrickslabs/ucx/issues/1367)). This
commit introduces a new class `DependencyResolver` that filters Python
import dependencies based on a whitelist, and updates to the
`DependencyGraph` class to support this new resolver. A new optional
parameter `resolver` has been added to the `NotebookMigrator` class
constructor and the `DependencyGraph` constructor. A new file
`whitelist.py` has been added, introducing classes and functions for
defining and managing a whitelist of Python packages based on their name
and version. These changes aim to improve control over which
dependencies are included in the dependency graph, contributing to a
more modular and maintainable codebase.
* Increased memory for ucx clusters
([#1366](https://github.com/databrickslabs/ucx/issues/1366)). This
release introduces an update to enhance memory configuration for UCX
clusters, addressing issue
[#1366](https://github.com/databrickslabs/ucx/issues/1366). The main
change involves a new method for selecting a node type with a minimum of
16GB of memory and local disk enabled, implemented in the policy.py file
of the installer module. This modification results in the `node_type_id`
parameter for creating clusters, instance pools, and pipelines now
requiring a minimum memory of 16 GB. This change is reflected in the
fixtures.py file, `ws.clusters.select_node_type()`,
`ws.instance_pools.create()`, and `pipelines.PipelineCluster` method
calls, ensuring that any newly created clusters, instance pools, and
pipelines benefit from the increased memory allocation. This update aims
to improve user experience by offering higher memory configurations
out-of-the-box for UCX-related workloads.
* Integrate detection of notebook dependencies
([#1338](https://github.com/databrickslabs/ucx/issues/1338)). In this
release, the NotebookMigrator has been updated to integrate dependency
graph construction for detecting notebook dependencies, addressing
issues 1204, 1286, and 1326. The changes include modifying the
NotebookMigrator class to include the dependency graph and updating
relevant tests. A new file, python_linter.py, has been added for linting
Python code, which now detects calls to "dbutils.notebook.run" with
dynamic paths. The linter uses the ast module to parse the code and
locate nodes matching the specified criteria. The NotebookMigrator's
apply method has been updated to check for ObjectType.NOTEBOOK, loading
the notebook using the new _load_notebook method, and incorporating a
new _apply method for modifying the code in the notebook based on
applicable fixes. A new DependencyGraph class has been introduced to
build a graph of dependencies within the notebook, and several new
methods have been added, including _load_object,
_load_notebook_from_path, and revert. This release is co-authored by Cor
and aims to improve dependency management in the notebook system.
* Isolate grants computation when migrating tables
([#1233](https://github.com/databrickslabs/ucx/issues/1233)). In this
release, we have implemented a change to improve the reliability of
table migrations. Previously, grants to migrate were computed and
snapshotted outside the loop that iterates through tables to migrate,
which could lead to inconsistencies if the grants or migrated groups
changed during migration. Now, grants are re-computed for each table,
reducing the chance of such issues. We have introduced a new method
`_compute_grants` that takes in the table to migrate, ACL strategy, and
snapshots of all grants to migrate, migrated groups, and principal
grants. If `acl_strategy` is `None`, it defaults to an empty list. The
method checks each strategy in the ACL strategy list, extending the
`grants` list if the strategy is `AclMigrationWhat.LEGACY_TACL` or
`AclMigrationWhat.PRINCIPAL`. The `migrate_tables` method has been
updated to use this new method to compute grants. It first checks if
`acl_strategy` is `None`, and if so, sets it to an empty list. It then
calls `_compute_grants` with the current table, `acl_strategy`, and the
snapshots of all grants to migrate, migrated groups, and principal
grants. The computed grants are then used to migrate the table. This
change enhances the robustness of the migration process by isolating
grants computation for each table.
* Log more often from workflows
([#1348](https://github.com/databrickslabs/ucx/issues/1348)). In this
update, the log formatting for the debug log file in the "tasks.py" file
of the "databricks/labs/ucx/framework" module has been modified. The
`TimedRotatingFileHandler` function has been adjusted to rotate the log
file every minute, increasing the frequency of log file rotation from
every 10 minutes. Furthermore, the logging format has been enhanced to
include the time, level name, name, thread name, and message. These
improvements are in response to issue
[#1171](https://github.com/databrickslabs/ucx/issues/1171) and the
implementation of more frequent logging as per issue
[#1348](https://github.com/databrickslabs/ucx/issues/1348), ensuring
more detailed and up-to-date logs for debugging and analysis purposes.
* Make `databricks labs ucx assign-metastore` prompt for workspace if no
workspace id provided
([#1500](https://github.com/databrickslabs/ucx/issues/1500)). The
`databricks labs ucx assign-metastore` command has been updated to allow
for a optional `workspace_id` parameter, with a prompt for the workspace
ID displayed if it is not provided. Both the `assign-metastore` and
`show-all-metastores` commands have been made account-level only. The
functionality of the `migrate_local_code` function remains unchanged.
Error handling for etag issues related to default catalog settings has
been implemented. Unit tests and manual testing have been conducted on a
staging environment to verify the changes. The `show_all_metastores` and
`assign_metastore` commands have been updated to accept an optional
`workspace_id` parameter. The unit tests cover various scenarios,
including cases where a user has multiple metastores and needs to select
one, as well as cases where a default catalog name is provided and needs
to be selected. If no metastore is found, a `ValueError` will be raised.
The `metastore_id` and `workspace_id` flags in the yml file have been
renamed to `metastore-id` and `workspace-id`, respectively, and a new
`default-catalog` flag has been added.
* Modified update existing role to amend the AssumeRole statement rather
than rewriting it
([#1423](https://github.com/databrickslabs/ucx/issues/1423)). The
`_aws_role_trust_doc` method of the `aws.py` file has been updated to
return a dictionary object instead of a JSON string for the AWS IAM role
trust policy document. This change allows for more fine-grained control
when updating the trust relationships of an existing role in AWS IAM.
The `create_uc_role` method has been updated to pass the role trust
document to the `_create_role` method using the `_get_json_for_cli`
method. The `update_uc_trust_role` method has been refactored to
retrieve the existing role's trust policy document, modify its
`Statement` field, and replace it with the returned value of the
`_aws_role_trust_doc` method with the specified `external_id`.
Additionally, the `test_update_uc_trust_role` function in the
`test_aws.py` file has been updated to provide more detailed and
realistic mocked responses for the `command_call` function, including
handling the case where the `iam update-assume-role-policy` command is
called and returning a mocked response with a modified assume role
policy document that includes a new principal with an external ID
condition. These changes improve the testing capabilities of the
`test_update_uc_trust_role` function and provide more comprehensive
testing of the assume role statement and role update functionality.
* Modifies dependency resolution logic to detect deprecated use of s3fs
package ([#1395](https://github.com/databrickslabs/ucx/issues/1395)). In
this release, the dependency resolution logic has been enhanced to
detect and handle deprecated usage of the s3fs package. A new function,
`_download_side_effect`, has been implemented to mock the download
behavior of the `workspace_client_mock` function, allowing for more
precise control during testing. The `DependencyResolver` class now
includes a list of `Advice` objects to inform developers about the use
of deprecated dependencies, without modifying the `DependencyGraph`
class. This change also introduces a new import statement for the s3fs
package, encouraging the adoption of up-to-date packages and practices
for improved system compatibility and maintainability. Additionally, a
unit test file, test_s3fs.py, has been added with test cases for various
import scenarios of s3fs to ensure proper detection and issuance of
deprecation warnings.
* Prompt for warehouse choice in uninstall if the original chosen
warehouse does not exist anymore
([#1484](https://github.com/databrickslabs/ucx/issues/1484)). In this
release, we have added a new method
`_check_and_fix_if_warehouse_does_not_exists()` to the
`WorkspaceInstaller` class, which checks if the specified warehouse in
the configuration still exists. If it doesn't, the method generates a
new configuration using a new `WorkspaceInstaller` object, saves it, and
updates the `_sql_backend` attribute with the new warehouse ID. This
change ensures that if the original chosen warehouse no longer exists,
the user will be prompted to choose a new one during uninstallation.
Additionally, we have added a new import statement for
`ResourceDoesNotExist` exception and introduced a new function
`test_uninstallation_after_warehouse_is_deleted`, which simulates a
scenario where a warehouse has been manually deleted and checks if the
uninstallation process correctly resets the warehouse. The
`StatementExecutionBackend` object is initialized with a non-existent
warehouse ID, and the configuration and sql_backend objects are updated
accordingly. This test case ensures that the uninstallation process
handles the scenario where a warehouse has been manually deleted.
* Propagate source location information within the import package
dependency graph
([#1431](https://github.com/databrickslabs/ucx/issues/1431)). This
change modifies the dependency graph build logic within several modules
of the `databricks.labs.ucx` package to propagate source location
information within the import package dependency graph. A new
`ImportDependency` class now represents import sources, and a
`list_import_sources` method returns a list of `ImportDependency`
objects, which include import string and original source code file path.
A new `IncompatiblePackage` class is added to the `Whitelist` class,
returning `UCCompatibility.NONE` when checking for compatibility. The
`ImportChecker` class checks for deprecated imports and returns `Advice`
or `Deprecation` objects with location information. Unit tests have been
added to ensure the correct behavior of these changes. Additionally, the
`Location` class and a new test function for invalid processors have
been introduced.
* Scan `site-packages`
([#1411](https://github.com/databrickslabs/ucx/issues/1411)). A
SitePackages scanner has been implemented, enhancing the linkage of
module root names with the actual Python code within installed packages
using metadata. This development addresses issue
[#1410](https://github.com/databrickslabs/ucx/issues/1410) and is
connected to [#1202](https://github.com/databrickslabs/ucx/issues/1202).
New functionalities include user documentation, a CLI command, a
workflow, and a table, accompanied by modifications to an existing
command and workflow, as well as alterations to another table. Unit
tests have been added to ensure the feature's proper functionality. In
the diff, a new unit test file for `site_packages.py` has been added,
checking for `databrix` compatibility, which returns as uncompatible.
This enhancement aims to bolster the user experience by providing more
detailed insights into installed packages.
* Select DISTINCT job_run_id
([#1352](https://github.com/databrickslabs/ucx/issues/1352)). A
modification has been implemented to optimize the SQL query for
accessing log data, now retrieving distinct job_run_ids instead of a
single one, nested in a subquery. The enhanced query selects the message
field from the inventory.logs table, filtering based on job_run_id
matches with the latest timestamp within the same table. This change
enables multiple job_run_ids to correlate with the same timestamp,
delivering a more holistic perspective of logs at a given moment. By
upgrading the query functionality to accommodate multiple job run IDs,
this improvement ensures more precise and detailed retrieval of log
data.
* Support table migration to Unity Catalog in Python code
([#1210](https://github.com/databrickslabs/ucx/issues/1210)). This
release introduces changes to the Python codebase that enhance the
SparkSql linter/fixer to support migrating Spark SQL table references to
Unity Catalog. The release includes modifications to existing commands,
specifically `databricks labs ucx migrate_local_code`, and the addition
of unit tests. The `SparkSql` class has been updated to support a new
`index` parameter, allowing for migration support. New classes including
`QueryMatcher`, `TableNameMatcher`, `ReturnValueMatcher`, and
`SparkMatchers` have been added to hold various matchers for different
spark methods. The release also includes modifications to existing
methods for caching, creating, getting, refreshing, and un-caching
tables, as well as updates to the `listTables` method to reflect the new
format. The `saveAsTable` and `register` methods have been updated to
handle variable and f-string arguments for the table name. The
`databricks labs ucx migrate_local_code` command has been modified to
handle spark.sql function calls that include a table name as a parameter
and suggest necessary changes to migrate to the new Unity Catalog
format. Integration tests are still needed.
* When building dependency graph, raise problems with problematic
dependencies
([#1529](https://github.com/databrickslabs/ucx/issues/1529)). A new
`DependencyProblem` class has been added to the
databricks.labs.ucx.source_code.dependencies module to handle issues
encountered during dependency graph construction. This class is used to
raise issues when problematic dependencies are encountered during the
build of the dependency graph. The `build_dependency_graph` method of
the `SourceContainer` abstract class now accepts a `problem_collector`
parameter, which is a callable function that collects and handles
dependency problems. Instead of raising `ValueError` exceptions, the
`DependencyProblem` class is used to collect and store information about
the issues. This change improves error handling and diagnostic
information during dependency graph construction. Relevant user
documentation, a new CLI command, and a new workflow have been added,
along with modifications to existing commands and workflows. Unit tests
have been added to verify the new functionality.
* WorkspacePath to implement `pathlib.Path` API
([#1509](https://github.com/databrickslabs/ucx/issues/1509)). A new
file, 'wspath.py', has been added to the `mixins` directory of the
'databricks.labs.ucx' package, implementing the custom Path object
'WorkspacePath'. This subclass of 'pathlib.Path' provides additional
methods and functionality for the Databricks Workspace, including
'cwd()', 'home()', 'scandir()', and 'listdir()'. `WorkspacePath`
interacts with the Databricks Workspace API for operations such as
checking if a file/directory exists, creating and deleting directories,
and downloading files. The `WorkspacePath` class has been updated to
implement 'pathlib.Path' API for a more intuitive and consistent
interface when working with file and directory paths. The class now
includes methods like 'absolute()', 'exists()', 'joinpath()', 'parent',
and supports the `with` statement for thread-safe code. A new test file
'test_wspath.py' has been added for the WorkspacePath mixin. New methods
like 'expanduser()', 'as_fuse()', 'as_uri()', 'replace()',
'write_text()', 'write_bytes()', 'read_text()', and 'read_bytes()' have
also been added. 'mkdir()' and 'rmdir()' now raise errors when called on
non-absolute paths and non-empty directories, respectively.

Dependency updates:

* Bump actions/checkout from 3 to 4
([#1191](https://github.com/databrickslabs/ucx/pull/1191)).
* Bump actions/setup-python from 4 to 5
([#1189](https://github.com/databrickslabs/ucx/pull/1189)).
* Bump codecov/codecov-action from 1 to 4
([#1190](https://github.com/databrickslabs/ucx/pull/1190)).
* Bump softprops/action-gh-release from 1 to 2
([#1188](https://github.com/databrickslabs/ucx/pull/1188)).
* Bump databricks-sdk from 0.23.0 to 0.24.0
([#1223](https://github.com/databrickslabs/ucx/pull/1223)).
* Updated databricks-labs-lsql requirement from ~=0.3.0 to >=0.3,<0.5
([#1387](https://github.com/databrickslabs/ucx/pull/1387)).
* Updated sqlglot requirement from ~=23.9.0 to >=23.9,<23.11
([#1409](https://github.com/databrickslabs/ucx/pull/1409)).
* Updated sqlglot requirement from <23.11,>=23.9 to >=23.9,<23.12
([#1486](https://github.com/databrickslabs/ucx/pull/1486)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]: Assign appropriate permission to UCX created access connectors
2 participants